<a href="https://colab.research.google.com/github/soberbichler/Workshop_QualitativeDataResearch_LLM/blob/main/HF_Jobs_Setting_UP.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Running LLM Jobs via HuggingFace

For explanations on Hugginface https://huggingface.co/docs/huggingface_hub/guides/jobs



##Requirements for Hugging Face Jobs



*   Hugging Face Pro account - A paid subscription is required to access job creation features
*   Write access token - Generate a token with write permissions from your account settings
*   Valid payment method - Jobs consume compute credits based on usage


##Authentication Setup



*   Create your access token at huggingface.co/settings/tokens (you will be given an API as part of the workshop)
*   Ensure the token has "Write" permissions enabled


##Prepare your HF Job Script:

This script creates a remote computational job on HuggingFace's infrastructure that loads a language model and answers a question. It uses `run_job` to spin up a GPU-enabled Docker container (PyTorch with CUDA), installs necessary Python packages (transformers, accelerate, etc.), then runs a Python script that loads a chosen model, defines a question ("What is machine learning and how does it work?"), creates a simple Q&A function that formats the question as a chat prompt, generates an answer using the loaded model with specific parameters (temperature 0.7, max 500 tokens), and finally prints both the question and the model's response in a formatted output. Essentially, it's a way to run AI inference on powerful remote hardware without needing local GPU resources - you submit the job, it runs on HuggingFace's servers, and you get the AI-generated answer back.

##Model

We are using the deepseek-ai/DeepSeek-R1-Distill-Qwen-14B model in this environment.

LLM Distillation is a technique where a smaller "student" model learns to mimic the behavior of a larger, more powerful "teacher" model. Think of it like an expert teacher passing their knowledge to a student who then becomes highly capable but more efficient.
How it works:

**Teacher Model:** A large, powerful model (in this case, DeepSeek-R1) that performs very well but is expensive to run
Student Model: A smaller model (here, Qwen-14B) that learns from both the original training data AND the teacher's responses
Knowledge Transfer: The student model learns to produce similar outputs to the teacher, capturing its "reasoning style" and capabilities

**For DeepSeek-R1-Distill-Qwen-14B specifically:**

Teacher: DeepSeek R1 (671 billion parameters)
Student: Qwen-14B (14 billion parameters - much smaller and faster)
Result: A 14B model that inherited DeepSeek R1's reasoning abilities but runs much more efficiently




> ***You also need to add the HF token in the script. Search for "your_token" and add your HF token there.***



In [None]:
from huggingface_hub import run_job

job = run_job(
    image="pytorch/pytorch:2.6.0-cuda12.4-cudnn9-devel",
    command=[
        "bash", "-c",
        """
        apt-get update && apt-get install -y wget &&
        pip install -q "transformers>=4.51.0" accelerate bitsandbytes huggingface_hub &&
        python3 -c "
import os, torch
from transformers import AutoTokenizer, AutoModelForCausalLM
from huggingface_hub import login

# CONFIGURATION
model_name = 'deepseek-ai/DeepSeek-R1-Distill-Qwen-14B'

# YOUR QUESTION HERE - CHANGE THIS!
QUESTION = 'What is machine learning and how does it work?'

# SETUP
hf_token = os.environ.get('HUGGINGFACE_TOKEN')
login(token=hf_token)

print('Loading model...')
tokenizer = AutoTokenizer.from_pretrained(model_name, token=hf_token)
tokenizer.pad_token = tokenizer.eos_token

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    device_map='auto',
    load_in_4bit=True,
    torch_dtype=torch.float16,
    token=hf_token
)
print('Model loaded successfully!')

# SIMPLE Q&A FUNCTION
def ask_question(question):
    prompt = f'''<|begin_of_text|><|start_header_id|>system<|end_header_id|>
You are a helpful AI assistant. Answer questions clearly and concisely.<|eot_id|>
<|start_header_id|>user<|end_header_id|>
{question}<|eot_id|>
<|start_header_id|>assistant<|end_header_id|>
'''

    inputs = tokenizer(prompt, return_tensors='pt', truncation=True, max_length=2048).to(model.device)
    input_length = inputs['input_ids'].shape[1]

    with torch.no_grad():
        outputs = model.generate(
            **inputs,
            max_new_tokens=500,
            temperature=0.7,
            do_sample=True,
            top_p=0.9,
            pad_token_id=tokenizer.eos_token_id
        )

    # Decode only the generated part
    generated_tokens = outputs[0][input_length:]
    response = tokenizer.decode(generated_tokens, skip_special_tokens=True)
    return response.strip()

# ASK YOUR QUESTION AND GET ANSWER
print('\\n' + '='*60)
print(f'QUESTION: {QUESTION}')
print('='*60)

try:
    answer = ask_question(QUESTION)
    print(f'ANSWER: {answer}')
except Exception as e:
    print(f'Error: {str(e)}')

print('\\n' + '='*60)
print('Job complete!')
"
        """
    ],
    flavor="a100-large",
    env={"HUGGINGFACE_TOKEN": "your_token"}
)

print(f"Job submitted! ID: {job.id}")
print(f"Monitor at: https://huggingface.co/jobs/oberbics/{job.id}")

In [None]:
from huggingface_hub import inspect_job, fetch_job_logs
import time

# Poll job status until it's done
while True:
    status = inspect_job(job_id=job.id).status.stage
    print(f"Job status: {status}")
    if status in ("COMPLETED", "ERROR"):
        break
    time.sleep(10)

# Fetch logs after completion
print("\n=== Job logs ===")
logs = list(fetch_job_logs(job_id=job.id))
for line in logs:
    print(line)
