# Directed Graph Trick (not submitted)

Very basic trick: treat the edgelist as a set of directed arcs and pick the node with in-degree 0.\
We will not submit this notebook to not skew the results on Kaggle!

Explanation: every non-root word has exactly one incoming arc (its head), and the root has none. So just picking the node with in‐degree 0 gets the  root with 100% accuracy (even on held-out data as it is in this notebook).

In [20]:
import pandas as pd
import ast
import networkx as nx
from sklearn.model_selection import GroupShuffleSplit

In [21]:
df = pd.read_csv('../../data/train.csv')

In [22]:
# split by sentence to get a train/val hold-out
gss = GroupShuffleSplit(n_splits=1, test_size=0.2, random_state=42)
train_idx, val_idx = next(gss.split(df, groups=df['sentence']))
train_df, val_df = df.iloc[train_idx], df.iloc[val_idx]

In [23]:
# “in-degree 0” root predictor
def predict_root(edgelist_str):
    edges = ast.literal_eval(edgelist_str)
    G = nx.DiGraph()
    G.add_edges_from(edges)
    # THE unique node with no incoming arcs
    for n, d in G.in_degree():
        if d == 0:
            return n
    return None

In [24]:
# evaluate on TRAIN and VAL
for name, part in (("TRAIN", train_df), ("VAL", val_df)):
    preds = part['edgelist'].apply(predict_root)
    acc = (preds == part['root']).mean()
    print(f"{name} accuracy (in-degree=0 heuristic): {acc:.4f}")

TRAIN accuracy (in-degree=0 heuristic): 1.0000
VAL accuracy (in-degree=0 heuristic): 1.0000


In [25]:
test_df = pd.read_csv('../../data/test.csv')
test_df['edgelist'] = test_df['edgelist'].apply(ast.literal_eval)

In [17]:
test_df['root'] = test_df['edgelist'].apply(predict_root)

In [18]:
labeled_test = test_df[['id', 'root']]

In [19]:
labeled_test.to_csv('../../data/labeled_test.csv', index=False)
print(f"Wrote {len(labeled_test)} rows to ../../data/labeled_test.csv")

Wrote 10395 rows to ../../data/labeled_test.csv
