## Feature Ideas

This task has a high degree of conceptual overlap with `engage`. For that reason, we will reuse
many of the engage features + add a few more badge-specific features.

- **Engage feature set** (row per user, timestamp; excluding some unimportant features):
    - User-level features:
        - `userId`
        - `timestamp`
        - `months_since_account_creation`
        - `num_badges`
        - `weeks_since_last_badge`
        - `badge_score`: sum of log(1 / badge_incidence) -- captures a total of badges weighted by their rarity. Eg: let's say a user has two badges, one that 5% of users have and another that 1% of users have, then the badge_score is log(20) + log(100)
    - Question feats:
        - last_question (feats as of the last question asked by user, NULL if not exists):
            - `weeks_ago`: how many weeks ago did the user post their last question 
            - `num_tags`
            - `body_length`
            - `avg_commenter_badge_score`
        - questions_last_yr (aggregate feats over questions asked by user in the last year, NULL otherwise):
            - `num_questions_last_6mo`
            - `avg_has_accepted_ans`
            - `avg_num_tags`
            - `avg_body_length`
            - `avg_num_postive_votes`
            - `avg_num_negative_votes`
            - `avg_num_comments`
            - `avg_commenter_badge_score`
    - Answer feats:
        - last_answer (feats as of the last answer posted by user, NULL if not exists)
            - `weeks_ago`: how many weeks ago did the user post their last answer 
            - `body_length`
            - `avg_commenter_badge_score`
        - answers_last_yr (aggregate feats over answeres posted by user in the last year, NULL otherwise):
            - `num_answeres_last_3mo`
            - `acceptance_rate`
            - `avg_body_length`
            - `avg_num_postive_votes`
            - `avg_num_negative_votes`
            - `avg_num_comments`
            - `avg_commenter_badge_score`
    - Extra badge features:
        - `last_badge_rarity`: rarity as defined in badge score, ie `log(1 / badge_incidence)`
        - `max_rarity`
        - `rarest_badge_weeks_ago`
        - `avg_badge_age`
        - `avg_time_bw_badges`
        - `avg_badge_rarity`
        - `badge_momentum`: An exponential moving average of badge rarity (a recent rare badge > an old rare badge)

In [1]:
cd ..

/lfs/hyperion/0/adobles/relbench-user-study/stack_exchange


In [2]:
import duckdb
import numpy as np
from relbench.datasets import get_dataset
from torch_frame import TaskType, stype
from torch_frame.gbdt import LightGBM, XGBoost
from torch_frame.data import Dataset
from torch_frame.typing import Metric
from torch_frame.utils import infer_df_stype

import utils

conn = duckdb.connect('stackex.db')
%load_ext sql
%sql conn --alias duckdb
%config SqlMagic.displaycon=False

In [3]:
%%sql
select 
    avg(WillGetBadge),
    count(*)
from badges_train

avg(WillGetBadge),count_star()
0.2995415831663326,399200


In [4]:
%%sql
select
    badges.UserId,
    first(badges.Name order by badges.Date asc) as first_badge,
    last(badges.Name order by badges.Date asc) as last_badge
from badges
group by badges.UserId
limit 10

UserId,first_badge,last_badge
40870,Autobiographer,Teacher
7367,Student,Popular Question
40878,Student,Editor
16986,Student,Yearling
40880,Student,Scholar
40864,Editor,Yearling
40882,Student,Nice Question
40883,Informed,Autobiographer
14657,Scholar,Yearling
39075,Student,Popular Question


## Tuning

In [25]:
with open('badges/feats.sql', 'r') as f:
    # run once with train_labels and once with val_labels
    template = f.read()

# create train, val and test features
# takes 1 - 5 mins
for s in ['train', 'val', 'test']:
    print(f'Creating {s} table')
    query = utils.render_jinja_sql(template, dict(set=s))
    conn.sql(query)
    print(f'{s} table created')

Creating train table
train table created
Creating val table
val table created
Creating test table
test table created


In [26]:
train_df = conn.sql('select * from badges_train_feats').df()
val_df = conn.sql('select * from badges_val_feats').df()

In [27]:
col_to_stype = infer_df_stype(train_df)
col_to_stype

{'user_id': <stype.numerical: 'numerical'>,
 'timestamp': <stype.timestamp: 'timestamp'>,
 'WillGetBadge': <stype.categorical: 'categorical'>,
 'months_since_account_creation': <stype.categorical: 'categorical'>,
 'num_badges': <stype.numerical: 'numerical'>,
 'badge_score': <stype.numerical: 'numerical'>,
 'max_rarity': <stype.numerical: 'numerical'>,
 'avg_rarity': <stype.numerical: 'numerical'>,
 'rarest_badge_age_weeks': <stype.categorical: 'categorical'>,
 'last_badge_rarity': <stype.numerical: 'numerical'>,
 'last_badge_weeks_ago': <stype.categorical: 'categorical'>,
 'avg_badge_age_weeks': <stype.numerical: 'numerical'>,
 'avg_weeks_bw_badges': <stype.numerical: 'numerical'>,
 'badge_momentum': <stype.numerical: 'numerical'>,
 'weeks_since_last_comment': <stype.categorical: 'categorical'>,
 'num_comments': <stype.numerical: 'numerical'>,
 'num_posts_commented': <stype.numerical: 'numerical'>,
 'avg_comment_length': <stype.numerical: 'numerical'>,
 'last_q_weeks_ago': <stype.cate

In [28]:
DROP_COLS = [
    # drop identifier cols
    'user_id',
    'timestamp',
]
for c in DROP_COLS:
    del col_to_stype[c]
# Correct certain columns
col_to_stype['months_since_account_creation'] = stype.numerical
col_to_stype['rarest_badge_age_weeks'] = stype.numerical
col_to_stype['last_badge_weeks_ago'] = stype.numerical
col_to_stype['weeks_since_last_comment'] = stype.numerical
col_to_stype['last_q_weeks_ago'] = stype.numerical
col_to_stype['last_q_num_tags'] = stype.numerical

In [29]:
train_dset = Dataset(
    train_df.drop(DROP_COLS, axis=1),
    col_to_stype=col_to_stype,
    target_col='WillGetBadge'
).materialize()
val_tf = train_dset.convert_to_tensor_frame(val_df.drop(DROP_COLS, axis=1))
tune_metric = Metric.ROCAUC
print(train_dset.tensor_frame.num_cols, train_dset.tensor_frame.num_rows)

43 399200


In [30]:
gbdt = XGBoost(TaskType.BINARY_CLASSIFICATION, num_classes=2, metric=tune_metric)
gbdt.tune(tf_train=train_dset.tensor_frame, tf_val=val_tf, num_trials=10, device='cuda')

  from .autonotebook import tqdm as notebook_tqdm
[I 2024-04-22 21:58:25,188] A new study created in memory with name: no-name-e211034d-eee6-42ce-8403-8fbdc5e8ad5a


: 

In [None]:
gbdt.save('data/badges.xgb')

## Val Eval

In [None]:
import plotly.graph_objects as go
from sklearn.metrics import roc_curve, average_precision_score, accuracy_score, PrecisionRecallDisplay