# Feature Engineering with tags

According to [this discussion](https://www.kaggle.com/c/jane-street-market-prediction/discussion/198965#1088878), `features.csv` can be key drivers of 
> relationship between the anonymized features

In this notebook, I propose clustering with almost 130 features in train data based on 28 tags.

In [None]:
import numpy as np
import pandas as pd
import os, gc
import datatable as dt
import pickle
from tqdm.notebook import tqdm
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
warnings.simplefilter(action='ignore', category=pd.core.common.SettingWithCopyWarning)

## What `features.csv` represents ?

The dataset does not contribute to express train features. Each tags (total of 28 tags setting) make relational groups of `feature_1` to `feature_129` in train dataset. The dataset shows which feature along to each tags.

In [None]:
tag_feats = pd.read_csv('../input/jane-street-market-prediction/features.csv')
tag_feats.head()

This notebook tries to make tag features with the information on the dataset above.

In [None]:
%%time
train = dt.fread('../input/jane-street-market-prediction/train.csv').to_pandas()

train_data = train[train['weight'] != 0]
train_data['action'] = (train_data['resp'].values > 0).astype('int')

# fill in the nan values in train dataset
f_mean = train_data.mean()
train_data = train_data.fillna(f_mean)

del train
gc.collect()

## Set feature & tag labels

In [None]:
features = [f'feature_{x}' for x in range(1,130)]
tags = [f'tag_{x}' for x in range(29)]

## tags clustering with features
- make features list of each tags based on `features.csv`
- caluculate mean of each tags

In [None]:
%%time

for tg in tqdm(tags):
    feat_list=tag_feats.loc[tag_feats[tg]==True,'feature'].values
    train_data[tg]=train_data.loc[:,feat_list].mean(axis=1)

### If this notebook is usuful even just a little, please upvote.