**Purpose** of this notebook is to obtain boosting weights, used in the track 2.

In [1]:
import sys
sys.path.append('..')

import json
import pandas as pd
from pathlib import Path
from utils import read_config, RootPath
from catboost import CatBoostRegressor, Pool
from sklearn.model_selection import train_test_split




reading config

In [3]:
config = read_config(RootPath('config.yaml', must_exist=True))
data_paths = config['data']

### Features from the first track

First of all we load features, that have been created on the first track. We have calculated them
previously using jarvis library, and scripts. In general they are:

    1. Jarvis descriptors on defects-only atoms of structures
    2. Jarvis descriptors on the whole structures
    3. Graph features on the whole structures
    4. MegNet predictions

We load those extracted features for train and private structures.

In [4]:
train_path_defects = RootPath(data_paths['public']['defects'], must_exist=True)
train_path_no_defects = RootPath(data_paths['public']['no_defects'], must_exist=True)

# loading targets
bandgaps = pd.read_csv(train_path_defects / "targets.csv").set_index('_id').sort_index()

# loading dumped data with jarvis descriptors extracted on defects only
train_defects_cfid = pd.read_csv(
    train_path_defects / "cfid" / "train.csv", index_col=0,
).set_index('_id').drop('band_gap', axis=1).add_prefix('defects_')

# jarvis descriptors on whole materials
train_no_defects_cfid = pd.read_csv(
    train_path_no_defects / "cfid" / "train.csv", index_col=0,
).set_index('_id').drop('band_gap', axis=1).add_prefix('no_defects_')

# graph features on whole materials
train_no_defects_graphs = pd.read_csv(
    train_path_no_defects / "graph" / "train.csv",
).set_index('_id').drop('band_gap', axis=1).add_prefix('graphs_')


features for ptivate structures

In [5]:
eval_path_defects = RootPath(data_paths['private']['defects'], must_exist=True)
eval_path_no_defects = RootPath(data_paths['private']['no_defects'], must_exist=True)

# loading dumped data with jarvis descriptors extracted on defects only
eval_defects_cfid = pd.read_csv(
    eval_path_defects / "cfid" / "eval.csv", index_col=0,
).add_prefix('defects_')

# jarvis descriptors on whole materials
eval_no_defects_cfid = pd.read_csv(
    eval_path_no_defects / "cfid" / "eval.csv", index_col=0,
).add_prefix('no_defects_')

# graph features on whole materials
eval_no_defects_graphs = pd.read_csv(
    eval_path_no_defects / "graph" / "eval.csv",
).set_index('_id').add_prefix('graphs_')


merging those extracted structure descriptions to the single dataframes:

In [6]:
train_dataframe = pd.concat([train_defects_cfid, train_no_defects_graphs, train_no_defects_cfid],
                            axis=1).sort_index()
eval_dataframe = pd.concat([eval_defects_cfid, eval_no_defects_graphs, eval_no_defects_cfid],
                           axis=1).sort_index()

reading MegNet predictions trained on defects only structures:

In [7]:
eval_preds = pd.read_csv(RootPath(
    Path(data_paths['private']['root']) / "megnet_private_predictions.csv", must_exist=True), index_col=0)
train_preds = pd.read_csv(RootPath(
    Path(data_paths['public']['root'] / "megnet_public_predictions.csv"), must_exist=True), index_col=0)

# join predictions to other
train_dataframe = train_dataframe.join(train_preds)
eval_dataframe = eval_dataframe.join(eval_preds)


In [8]:
assert 'predictions' in train_dataframe
assert 'predictions' in eval_dataframe
assert len(train_dataframe) == 2966
assert len(eval_dataframe) == 2967
assert (train_dataframe.index == bandgaps.index).all()

### Feature selection

In [381]:
X_train, X_eval, y_train, y_eval = train_test_split(train_dataframe.drop('graphs_formula', axis=1),
                                                    bandgaps,
                                                    test_size=0.1, random_state=42)

train_pool = Pool(data=X_train, label=y_train, has_header=True)
eval_pool = Pool(data=X_eval, label=y_eval, has_header=True)

model = CatBoostRegressor(custom_metric='MAE', iterations=1000, depth=6, random_seed=42)

features = model.select_features(
    train_pool,
    eval_set=eval_pool,
    features_for_select=range(len(train_pool.get_feature_names())),
    num_features_to_select=500,
    steps=5,
    train_final_model=False,
)

Custom logger is already specified. Specify more than one logger at same time is not thread safe.

Learning rate set to 0.059406
Step #1 out of 5
0:	learn: 0.4787095	test: 0.4726250	best: 0.4726250 (0)	total: 35.1ms	remaining: 35s
1:	learn: 0.4509774	test: 0.4451110	best: 0.4451110 (1)	total: 58.8ms	remaining: 29.3s
2:	learn: 0.4249595	test: 0.4194356	best: 0.4194356 (2)	total: 96.5ms	remaining: 32.1s
3:	learn: 0.4006481	test: 0.3952922	best: 0.3952922 (3)	total: 122ms	remaining: 30.4s
4:	learn: 0.3776860	test: 0.3725751	best: 0.3725751 (4)	total: 144ms	remaining: 28.7s
5:	learn: 0.3559544	test: 0.3510230	best: 0.3510230 (5)	total: 162ms	remaining: 26.9s
6:	learn: 0.3354940	test: 0.3307071	best: 0.3307071 (6)	total: 177ms	remaining: 25.1s
7:	learn: 0.3165443	test: 0.3118646	best: 0.3118646 (7)	total: 190ms	remaining: 23.6s
8:	learn: 0.2987694	test: 0.2941692	best: 0.2941692 (8)	total: 230ms	remaining: 25.3s
9:	learn: 0.2818246	test: 0.2772571	best: 0.2772571 (9)	total: 274ms	remaining: 27.1s
10:	learn: 0.2659364	test: 0.2613903	best: 0.2613903 (10)	total: 310ms	remaining: 27.9s
11:	

In [None]:
importances = model.get_feature_importance(eval_pool, 'LossFunctionChange', prettified=True)
best_features = importances[importances['Importances'] > 0]['Feature Id'].values.tolist()


# json.dump(best_features, open("catboost_features.json", "w"))


if Path("catboost_features.json").exists():
    best_features = json.load(open("catboost_features.json", "r"))
else:
    json.dump(best_features, open("catboost_features.json", "w"))
    best_features = json.load(open("catboost_features.json", "r"))


### Learning

In [388]:
X_train, X_eval, y_train, y_eval = train_test_split(train_dataframe,
                                                    bandgaps,
                                                    test_size=0.1, random_state=42)

train_pool = Pool(data=X_train[best_features], label=y_train, has_header=True)
eval_pool = Pool(data=X_eval[best_features], label=y_eval, has_header=True)


model = CatBoostRegressor(custom_metric='MAE', iterations=1000, depth=6, random_seed=42)
model.fit(train_pool, verbose=100)

Learning rate set to 0.047813
0:	learn: 0.4845171	total: 4.78ms	remaining: 4.77s
100:	learn: 0.0164089	total: 200ms	remaining: 1.78s
200:	learn: 0.0118590	total: 382ms	remaining: 1.52s
300:	learn: 0.0101009	total: 674ms	remaining: 1.56s
400:	learn: 0.0093018	total: 875ms	remaining: 1.31s
500:	learn: 0.0086742	total: 1.11s	remaining: 1.11s
600:	learn: 0.0082807	total: 1.32s	remaining: 879ms
700:	learn: 0.0079227	total: 1.56s	remaining: 665ms
800:	learn: 0.0076472	total: 1.76s	remaining: 438ms
900:	learn: 0.0074349	total: 1.97s	remaining: 217ms
999:	learn: 0.0072327	total: 2.2s	remaining: 0us


<catboost.core.CatBoostRegressor at 0x138665310>

Finally saving trained model

In [352]:
model.save_model(RootPath(config['model']['boosting']))

and dumping selected features

In [None]:
eval_dataframe[best_features].to_csv(RootPath(config['data']['private']['features']))