<a href="https://colab.research.google.com/github/soberbichler/Workshop_QualitativeDataResearch_LLM/blob/main/Analyze_Dataset_Huggingface_Jobs.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Running LLM Jobs via HuggingFace

For explanations on Hugginface https://huggingface.co/docs/huggingface_hub/guides/jobs



##Requirements for Hugging Face Jobs



*   Hugging Face Pro account - A paid subscription is required to access job creation features
*   Write access token - Generate a token with write permissions from your account settings
*   Valid payment method - Jobs consume compute credits based on usage


##Authentication Setup



*   Create your access token at huggingface.co/settings/tokens (you will be given an API as part of the workshop)
*   Ensure the token has "Write" permissions enabled
*   Save your token as HF_Token under Secrets (where the key symbol is)


# Analyze our Dataset

The dataset is already integrated into the script below. Can you find the position where this happens? Which column of the data is being used for the analysis? Where is the prompt and what model is being used?



> ***Answer those questions before you run the script (both cells). After runnung the script, fill in the model documentation while waiting!***

Model documentation: https://seafile.rlp.net/seafhttp/f/a5b34ec61267408da431/?op=view



### Run the script



> ***Don't forget to add your token in your script instead of "your_token"***





In [None]:
from huggingface_hub import run_job

job = run_job(
    image="pytorch/pytorch:2.6.0-cuda12.4-cudnn9-devel",
    command=[
        "bash", "-c",
        """
        apt-get update && apt-get install -y wget &&
        pip install -q "transformers>=4.51.0" accelerate bitsandbytes huggingface_hub pandas &&
        wget -O SummerSchool_dataset.csv https://raw.githubusercontent.com/soberbichler/Workshop_QualitativeDataResearch_LLM/refs/heads/main/data/SummerSchool_dataset.csv &&
        python3 -c "
import os, torch, pandas as pd, datetime, re
from transformers import AutoTokenizer, AutoModelForCausalLM
from huggingface_hub import login, upload_file


# CONFIGURATION

model_name = 'deepseek-ai/DeepSeek-R1-Distill-Qwen-14B'

# YOUR PROMPT

SYSTEM_PROMPT = '''You are an expert at analyzing historical texts and you do not summarize
OUTPUT FORMAT - EXACTLY these 4 XML tags and NOTHING else:
<argument>Original argument text OR "NA"</argument>
<claim>Core claim (implication) in one sentence OR "NA"</claim>
<explanation>Why this is an argument OR "NA"</explanation>
<human_verification_needed>True OR False</human_verification_needed>
EXAMPLE WITH ARGUMENT:
<argument>Es sind furchtbare Bilder, die sich dabei entrollen. Unter den Trümmern des einen Hause», so erzählt Luigt Barsint im Corrtcre della sera, findet man die Leichen von Unglück lichen, die in anderen Häusern gewohnt baben und die in der Ber- Wirrung de» schrcck.ichen Augenblickes instinktiv bet Fremden Hülfe und Unterschlupf suchten. Niemand erkennt jetzt diese armen Ein dringlinge, ihre Leichen werden nicht reklamiert, und man trägt sie hinunter an de» Strand, wo sie in langer Reihe einer neben den anderen hingebettet weiden, in denselben Tüchern und Decken, in denen sie tbren Tod gesunden.</argument>
<claim>The earthquake's chaos led to unidentified victims dying in unfamiliar places.</claim>
<explanation>Describes how people fled to other houses seeking help during the disaster, died there, and now cannot be identified or claimed by relatives. Shows cause (panic/confusion) and effect (anonymous deaths).</explanation>
<human_verification_needed>False</human_verification_needed>
EXAMPLE WITHOUT ARGUMENT:
<argument>NA</argument>
<claim>NA</claim>
<explanation>NA</explanation>
<human_verification_needed>FALSE</human_verification_needed>
RULES:
- NO SUMMARY; ONLY ORIGINAL EXTRACTOM FROM THE TEXT; don't extract anything that is not in the text. Only extract word by word
- ONLY output these 4 XML tags
- Factual reportings are NO arguments
- Extract only original text without changes or use NA when you did not find an argument
- The claim is not a translation of summary of argument. It should say what the (implicite) argument implies
- In cases of uncertainty or ambiguity, say human_verification_needed TRUE
- If no argument exists, use NA for all fields except <human_verification_needed>FALSE or TRUE</human_verification_needed>
- More than one argumentative unit possible for one aticle, one unit has one clear clame and all the xml structures
'''


# SETUP


hf_token = os.environ.get('HUGGINGFACE_TOKEN')
login(token=hf_token)

df = pd.read_csv('SummerSchool_dataset.csv', sep=';')
print(f'Dataset loaded with {len(df)} rows')


# LOAD MODEL

print('Loading model...')
tokenizer = AutoTokenizer.from_pretrained(model_name, token=hf_token)
tokenizer.pad_token = tokenizer.eos_token

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    device_map='auto',
    load_in_4bit=True,
    torch_dtype=torch.float16,
    token=hf_token
)
print('Model loaded successfully!')


# GENERATION FUNCTION


def generate_structured_response(model, tokenizer, text_to_analyze):
    '''Generate response with exact training format'''

    # EXACT format as in training
    user_instruction = '''Extract argumentative units in its original form.
    Text to analyze:
    '''

    full_prompt = f'''<|begin_of_text|><|start_header_id|>system<|end_header_id|>
{SYSTEM_PROMPT}<|eot_id|>
<|start_header_id|>user<|end_header_id|>
{user_instruction}{text_to_analyze[:5500]}<|eot_id|>
<|start_header_id|>assistant<|end_header_id|>
'''

    inputs = tokenizer(full_prompt, return_tensors='pt', truncation=True, max_length=5048).to(model.device)
    input_length = inputs['input_ids'].shape[1]

    with torch.no_grad():
        outputs = model.generate(
            **inputs,
            max_new_tokens=5000,
            temperature=0.1,
            do_sample=False,
            top_p=0.9,
            pad_token_id=tokenizer.eos_token_id,
            #repetition_penalty=1.3
        )

    # Decode ONLY the generated part
    generated_tokens = outputs[0][input_length:]
    response = tokenizer.decode(generated_tokens, skip_special_tokens=True)

    # Clean the response
    response = response.strip()

    # If model outputs the prompt again, extract just XML
    if '<|begin_of_text|>' in response or '<|start_header_id|>' in response:
        # Extract only the XML part
        xml_match = re.search(r'(<argument>.*?</human_verification_needed>)', response, re.DOTALL)
        if xml_match:
            response = xml_match.group(1)

    # Extract clean XML structure if it exists
    if '<argument>' in response and '</human_verification_needed>' in response:
        start_idx = response.find('<argument>')
        end_idx = response.rfind('</human_verification_needed>') + len('</human_verification_needed>')
        response = response[start_idx:end_idx]

    # Validate structure
    required_tags = ['<argument>', '</argument>', '<claim>', '</claim>',
                     '<explanation>', '</explanation>',
                     '<human_verification_needed>', '</human_verification_needed>']

    if not all(tag in response for tag in required_tags):
        print('  Warning: Using fallback')
        response = '''<argument>NA</argument>
<claim>NA</claim>
<explanation>NA</explanation>
<human_verification_needed>NA</human_verification_needed>'''

    return response

def parse_structured_response(response):

    parsed = {}

    patterns = {
        'argument': r'<argument>(.*?)</argument>',
        'claim': r'<claim>(.*?)</claim>',
        'explanation': r'<explanation>(.*?)</explanation>',
        'verification_needed': r'<human_verification_needed>(.*?)</human_verification_needed>'
    }

    for key, pattern in patterns.items():
        match = re.search(pattern, response, re.DOTALL)
        if match:
            parsed[key] = match.group(1).strip()
        else:
            parsed[key] = 'ERROR_MISSING'

    # Check if argument exists
    has_arg = (
        parsed['argument'] != 'NA' and
        parsed['argument'] != 'ERROR_MISSING' and
        len(parsed['argument']) > 5
    )
    parsed['has_argument'] = has_arg

    return parsed


# MAIN PROCESSING LOOP

results = []
structure_compliance = {'perfect': 0, 'failed': 0}

for idx, row in df.iterrows():
    text = str(row.get('extracted_articles', ''))

    if pd.isna(text) or text.strip() in ['nan', '']:
        print(f'Row {idx}: empty, skipping')
        continue

    print(f'Processing row {idx}...')

    try:
        # Generate response
        response = generate_structured_response(model, tokenizer, text)

        # Parse response
        parsed = parse_structured_response(response)

        # Check structure compliance
        if all(parsed[k] != 'ERROR_MISSING' for k in ['argument', 'claim', 'explanation', 'verification_needed']):
            structure_compliance['perfect'] += 1
            print(f'  ✓ Perfect structure')
        else:
            structure_compliance['failed'] += 1
            print(f'  ✗ Failed structure')

        # Store results - USING PARSED VALUES CORRECTLY
        result_row = row.to_dict()
        result_row['llm_raw_response'] = response[:5000]
        result_row['argument'] = parsed['argument']
        result_row['claim'] = parsed['claim']
        result_row['explanation'] = parsed['explanation']
        result_row['human_verification_needed'] = parsed['verification_needed']
        result_row['has_argument'] = parsed['has_argument']
        result_row['processed_row_index'] = idx
        result_row['model_used'] = model_name
        results.append(result_row)

        # Show result
        if parsed['has_argument']:
            arg_text = parsed['argument']
            preview = arg_text[:50] + '...' if len(arg_text) > 50 else arg_text
            print(f'  Found: {preview}')
        else:
            print(f'  No argument (NA)')

    except Exception as e:
        print(f'  Error: {str(e)}')
        result_row = row.to_dict()
        result_row['llm_raw_response'] = 'ERROR'
        result_row['argument'] = str(e)
        result_row['claim'] = 'ERROR'
        result_row['explanation'] = 'ERROR'
        result_row['human_verification_needed'] = 'True'
        result_row['has_argument'] = False
        result_row['processed_row_index'] = idx
        result_row['model_used'] = model_name
        results.append(result_row)


# COMPLIANCE REPORT


total = sum(structure_compliance.values())
if total > 0:
    print('\\n' + '='*50)
    perf = structure_compliance['perfect']
    fail = structure_compliance['failed']
    print(f'STRUCTURE COMPLIANCE:')
    print(f'  Perfect: {perf}/{total} ({perf/total*100:.1f}%)')
    print(f'  Failed: {fail}/{total} ({fail/total*100:.1f}%)')
    print('='*50)


# SAVE AND UPLOAD


output_df = pd.DataFrame(results)
output_df.to_csv('output_analysis_structured.csv', index=False)
print(f'\\nSaved {len(output_df)} results to CSV')


# Upload to Hugging Face
timestamp = datetime.datetime.now().strftime('%Y%m%d_%H%M%S')
filename = f'llm_structured_results_{timestamp}.csv'

try:
    print(f'\\nUploading as {filename}...')
    upload_file(
        path_or_fileobj='output_analysis_structured.csv',
        path_in_repo=filename,
        repo_id='oberbics/jobs',
        repo_type='dataset',
        token=hf_token,
        commit_message=f'Structured LLM analysis - {timestamp}'
    )
    print(f'\\n✅ SUCCESS! File uploaded: {filename}')
    print(f'📥 Download: https://huggingface.co/datasets/oberbics/jobs/resolve/main/{filename}')
except Exception as e:
    print(f'\\n❌ Upload failed: {str(e)}')

print(f'\\n🎯 Job complete! Processed {len(output_df)} rows')
"
        """
    ],
    flavor="a100-large",
    env={"HUGGINGFACE_TOKEN": "your_token"}
)

print(f"Job submitted! ID: {job.id}")
print(f"Monitor at: https://huggingface.co/jobs/oberbics/{job.id}")

# Monitor the job and get the results!

Open the CSV file, copy it into a .txt file and open it in excel and safe the results. To open it in Excel, open a new file, go to "data" and the data tab and click From Text/CSV


In [None]:
from huggingface_hub import inspect_job, fetch_job_logs
import time

# Poll job status until it's done
while True:
    status = inspect_job(job_id=job.id).status.stage
    print(f"Job status: {status}")
    if status in ("COMPLETED", "ERROR"):
        break
    time.sleep(10)

# Fetch logs after completion
print("\n=== Job logs ===")
logs = list(fetch_job_logs(job_id=job.id))
for line in logs:
    print(line)
