# Deep Past: GitHub Actions Cloud Workflow + TF-IDF Baseline

This notebook demonstrates a **cloud-first workflow** for Kaggle code competitions using **GitHub Actions + Kaggle API**.

## The Problem

- No local GPU
- Limited disk space
- Want to work from any device

## The Solution

```
Edit notebook locally → git push → GitHub Actions → kaggle kernels push → Submit via browser
```

Everything runs in the cloud. The only manual step is clicking "Submit to Competition" in the browser.

## Why Not Fully Automated?

The Kaggle API's `CreateCodeSubmission` endpoint returns **403 Forbidden** (`Permission 'kernelSessions.get' was denied`). This applies to both Legacy API Keys and new API Tokens (KGAT_...), across CLI versions 1.8.4 and 2.0.0 (as of Feb 2026). File submission (`competitions submit`) also returns 400 for code competitions.

## Workflow Overview

| Step | Automated? | How |
|---|---|---|
| Edit notebook | - | Any device |
| Upload to Kaggle | **Yes** | GitHub Actions + `kaggle kernels push` |
| Submit | **Manual** | Browser: "Submit to Competition" |
| Check score | **Yes** | `kaggle competitions submissions` API |

## Key Gotchas

1. **`enable_internet` must be `false`** — Internet ON prevents the notebook from being eligible for submission
2. **Data path**: `competition_sources` mounts at `/kaggle/input/competitions/<slug>/`, NOT `/kaggle/input/<slug>/`
3. **`kernels output` on Windows**: Non-ASCII characters (like Akkadian) cause cp932 encoding errors — use the API directly

---

Now let's build the actual baseline and generate a submission.

## Step 1: Verify Data Path

First, let's confirm where the competition data is mounted. This is a common gotcha.

In [None]:
from pathlib import Path

input_dir = Path('/kaggle/input')
for item in sorted(input_dir.iterdir()):
    print(f'{item.name}/')
    for sub in sorted(item.iterdir()):
        print(f'  {sub.name} ({sub.stat().st_size:,} bytes)')

## Step 2: Load Data

Note the path: `/kaggle/input/competitions/deep-past-initiative-machine-translation/`

In [None]:
import pandas as pd
import numpy as np

DATA_DIR = Path('/kaggle/input/competitions/deep-past-initiative-machine-translation')

train = pd.read_csv(DATA_DIR / 'train.csv')
test = pd.read_csv(DATA_DIR / 'test.csv')
sample_sub = pd.read_csv(DATA_DIR / 'sample_submission.csv')

print(f'Train: {train.shape}')
print(f'  Columns: {train.columns.tolist()}')
print(f'  Sample transliteration: {train["transliteration"].iloc[0][:80]}')
print(f'  Sample translation: {train["translation"].iloc[0][:80]}')
print()
print(f'Test: {test.shape}')
print(f'  Columns: {test.columns.tolist()}')
print()
print(f'Submission format: {sample_sub.shape}')
print(f'  Columns: {sample_sub.columns.tolist()}')

## Step 3: TF-IDF Nearest Neighbor Baseline

For this translation task (Akkadian → English), we use a simple approach:
1. Vectorize all transliterations using **character n-gram TF-IDF**
2. For each test row, find the **most similar** training transliteration
3. Use the corresponding English translation as our prediction

Character n-grams work well for transliterated cuneiform text where token boundaries are marked with hyphens.

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

train['transliteration'] = train['transliteration'].fillna('')
train['translation'] = train['translation'].fillna('')
test['transliteration'] = test['transliteration'].fillna('')

vectorizer = TfidfVectorizer(
    analyzer='char_wb',
    ngram_range=(2, 5),
    max_features=50000,
    sublinear_tf=True
)

train_tfidf = vectorizer.fit_transform(train['transliteration'])
test_tfidf = vectorizer.transform(test['transliteration'])

print(f'Train TF-IDF matrix: {train_tfidf.shape}')
print(f'Test TF-IDF matrix: {test_tfidf.shape}')

## Step 4: Find Nearest Neighbors and Predict

In [None]:
sims = cosine_similarity(test_tfidf, train_tfidf)
best_idx = sims.argmax(axis=1)
best_scores = sims.max(axis=1)

predictions = []
for i, (idx, score) in enumerate(zip(best_idx, best_scores)):
    pred = train['translation'].iloc[idx]
    predictions.append(pred)
    print(f'--- Test {i} (similarity: {score:.3f}) ---')
    print(f'  Test:  {test["transliteration"].iloc[i][:100]}')
    print(f'  Match: {train["transliteration"].iloc[idx][:100]}')
    print(f'  Pred:  {pred[:100]}')
    print()

## Step 5: Create Submission

In [None]:
submission = pd.DataFrame({
    'id': test['id'],
    'translation': predictions
})
submission['translation'] = submission['translation'].fillna('unknown')

print(submission)
print()

submission.to_csv('/kaggle/working/submission.csv', index=False)
print(f'Saved submission.csv ({submission.shape})')

## GitHub Actions Workflow

This notebook was pushed to Kaggle using the following GitHub Actions workflow:

```yaml
name: Kaggle Kernels Push

on:
  workflow_dispatch:
    inputs:
      notebook_dir:
        description: 'Notebook directory (e.g., deep-past)'
        required: true
        type: string

jobs:
  push:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Set up Python
        uses: actions/setup-python@v5
        with:
          python-version: '3.12'

      - name: Install kaggle CLI
        run: pip install kaggle

      - name: Push notebook to Kaggle
        env:
          KAGGLE_API_TOKEN: ${{ secrets.KAGGLE_API_TOKEN }}
        run: kaggle kernels push -p "${{ inputs.notebook_dir }}"
```

### kernel-metadata.json

```json
{
  "id": "your-username/your-kernel-slug",
  "title": "Your Title",
  "code_file": "your-notebook.ipynb",
  "language": "python",
  "kernel_type": "notebook",
  "is_private": "false",
  "enable_gpu": "false",
  "enable_internet": "false",
  "competition_sources": ["competition-slug"]
}
```

**Important**: `enable_internet` must be `"false"` for code competition submissions.

## Full blog post

- [DEV.to](https://dev.to/yasumorishima/kaggle-code-competitions-without-a-local-gpu-github-actions-kaggle-api-cloud-workflow-m3)

If you found this useful, please upvote!