# Classification Tutorial with BilbyStats

This comprehensive tutorial demonstrates how to perform text classification using the BilbyStats library. We'll cover traditional machine learning approaches, transformer-based models, and LLM-based classification.

## Table of Contents

1. [Setup and Data Loading](#setup)
2. [Data Preparation](#data-prep)
3. [Traditional ML Classification](#traditional-ml)
4. [Transformer-Based Classification](#transformer)
5. [LLM-Based Classification](#llm)
6. [Model Evaluation and Comparison](#evaluation)
7. [Advanced Topics](#advanced)

## 1. Setup and Data Loading {#setup}

First, let's import the necessary libraries and load our dataset.

In [1]:
import pandas as pd
import numpy as np
import bilbystats as bs
from sklearn.metrics import accuracy_score, classification_report
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings('ignore')

# Load the gold dataset for financial sentiment analysis
df = bs.read_data('gold-dataset-sinha-khandait.csv')

# Display basic information about the dataset
print(f"Dataset shape: {df.shape}")
print(f"Columns: {list(df.columns)}")
print("\nFirst few rows:")
df.head()

  from .autonotebook import tqdm as notebook_tqdm


Dataset shape: (10570, 10)
Columns: ['Dates', 'URL', 'News', 'Price Direction Up', 'Price Direction Constant', 'Price Direction Down', 'Asset Comparision', 'Past Information', 'Future Information', 'Price Sentiment']

First few rows:


Unnamed: 0,Dates,URL,News,Price Direction Up,Price Direction Constant,Price Direction Down,Asset Comparision,Past Information,Future Information,Price Sentiment
0,28-01-2016,http://www.marketwatch.com/story/april-gold-do...,"april gold down 20 cents to settle at $1,116.1...",0,0,1,0,1,0,negative
1,13-09-2017,http://www.marketwatch.com/story/gold-prices-s...,gold suffers third straight daily decline,0,0,1,0,1,0,negative
2,26-07-2016,http://www.marketwatch.com/story/gold-futures-...,Gold futures edge up after two-session decline,1,0,0,0,1,0,positive
3,28-02-2018,https://www.metalsdaily.com/link/277199/dent-r...,dent research : is gold's day in the sun comin...,0,0,0,0,0,1,none
4,06-09-2017,http://www.marketwatch.com/story/gold-steadies...,"Gold snaps three-day rally as Trump, lawmakers...",0,0,1,0,1,0,negative


In [2]:
# Examine the distribution of our target variable
print("Target variable distribution:")
print(df['Price Direction Up'].value_counts())

# Look at some example texts
print("\nSample news text:")
for i in range(3):
    print(f"Text {i+1}: {df['News'].iloc[i]}")
    print(f"Label: {df['Price Direction Up'].iloc[i]}\n")

Target variable distribution:
Price Direction Up
0    6158
1    4412
Name: count, dtype: int64

Sample news text:
Text 1: april gold down 20 cents to settle at $1,116.10/oz
Label: 0

Text 2: gold suffers third straight daily decline
Label: 0

Text 3: Gold futures edge up after two-session decline
Label: 1



## 2. Data Preparation {#data-prep}

BilbyStats provides convenient functions for splitting data and preparing it for different types of models.

In [3]:
# For this tutorial, let's use a subset for faster training
df_subset = df.head(2000)  # Adjust size based on your computational resources

# Define our input text column and target variable
covariate = 'News'
target = 'Price Direction Up'

# Split indices into train/validation/test sets
indices = bs.data_idx_split(df_subset.index, ratio=0.3, valratio=0.5, random_state=42)

print(f"Training samples: {len(indices['train'])}")
print(f"Validation samples: {len(indices['valid'])}")
print(f"Test samples: {len(indices['test'])}")

# Create train/validation/test datasets for transformer models
train_data, valid_data, test_data = bs.train_val_test_split(
    df_subset, covariate, target, indices
)

print(f"\nTraining data shape: {len(train_data)}")
print(f"Validation data shape: {len(valid_data)}")
print(f"Test data shape: {len(test_data)}")

Training samples: 1400
Validation samples: 300
Test samples: 300

Training data shape: 1400
Validation data shape: 300
Test data shape: 300


## 3. Traditional ML Classification {#traditional-ml}

Before jumping to deep learning, let's establish a baseline using traditional machine learning approaches.

In [4]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report

# Create TF-IDF features
vectorizer = TfidfVectorizer(max_features=5000, stop_words='english')

# Extract texts and labels
train_texts = [train_data[i]['text'] for i in range(len(train_data))]
train_labels = [train_data[i]['label'] for i in range(len(train_data))]

valid_texts = [valid_data[i]['text'] for i in range(len(valid_data))]
valid_labels = [valid_data[i]['label'] for i in range(len(valid_data))]

test_texts = [test_data[i]['text'] for i in range(len(test_data))]
test_labels = [test_data[i]['label'] for i in range(len(test_data))]

# Fit vectorizer and transform texts
X_train = vectorizer.fit_transform(train_texts)
X_valid = vectorizer.transform(valid_texts)
X_test = vectorizer.transform(test_texts)

# Train logistic regression model
lr_model = LogisticRegression(random_state=42)
lr_model.fit(X_train, train_labels)

# Make predictions
lr_predictions = lr_model.predict(X_test)
lr_accuracy = accuracy_score(test_labels, lr_predictions)

print(f"Logistic Regression Accuracy: {lr_accuracy:.4f}")
print("\nClassification Report:")
print(classification_report(test_labels, lr_predictions))

Logistic Regression Accuracy: 0.7967

Classification Report:
              precision    recall  f1-score   support

           0       0.76      0.92      0.83       166
           1       0.87      0.64      0.74       134

    accuracy                           0.80       300
   macro avg       0.81      0.78      0.79       300
weighted avg       0.81      0.80      0.79       300



## 4. Transformer-Based Classification {#transformer}

Now let's use BilbyStats to train a transformer-based model for better performance.

In [5]:
# Choose a pre-trained model
model_name = "distilbert-base-uncased"

# Tokenize the datasets for transformer training
train_data_tk, valid_data_tk, test_data_tk = bs.tokenize_data(
    train_data, valid_data, test_data, model_name
)

print("Data tokenized successfully!")
print(f"Training tokens shape: {train_data_tk['input_ids'].shape}")

Map: 100%|██████████| 1400/1400 [00:00<00:00, 18365.69 examples/s]
Map: 100%|██████████| 300/300 [00:00<00:00, 19839.67 examples/s]
Map: 100%|██████████| 300/300 [00:00<00:00, 19407.29 examples/s]

Data tokenized successfully!
Training tokens shape: torch.Size([1400, 512])





In [6]:
# Set up training configuration
savedir = "./model_checkpoints/"  # Change this to your desired directory
savename = "financial_sentiment_classifier"

# Define label mapping
label2id = {"NEGATIVE": 0, "POSITIVE": 1}
num_labels = len(label2id)

# Train the transformer model
print("Starting training...")
bs.tic()  # Start timer

trainer, model, training_args = bs.trainTFmodel(
    train_data_tk, 
    valid_data_tk, 
    model_name, 
    savename=savename, 
    savedir=savedir, 
    num_labels=num_labels, 
    label2id=label2id
)

bs.toc()  # End timer
print("Training completed!")

Starting training...


Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).


Epoch,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1
1,No log,0.308849,0.886667,0.892978,0.881152,0.884401
2,0.397800,0.327547,0.906667,0.908609,0.903632,0.905452
3,0.397800,0.367787,0.916667,0.916787,0.915163,0.915881


Elapsed time: 200.006812 seconds
Training completed!


In [7]:
# Evaluate the transformer model
model_path = f"{savedir}{savename}"

# Make predictions on test set
predictions = bs.predict(test_data, model_path, model_name)

transformer_accuracy = accuracy_score(predictions['true_labels'], predictions['pred_labels'])
print(f"Transformer Model Accuracy: {transformer_accuracy:.4f}")
print("\nClassification Report:")
print(classification_report(predictions['true_labels'], predictions['pred_labels']))

Map: 100%|██████████| 300/300 [00:00<00:00, 13633.95 examples/s]


ValueError: Unrecognized model in ./model_checkpoints/financial_sentiment_classifier. Should have a `model_type` key in its config.json, or contain one of the following strings in its name: albert, align, altclip, aria, aria_text, audio-spectrogram-transformer, autoformer, aya_vision, bamba, bark, bart, beit, bert, bert-generation, big_bird, bigbird_pegasus, biogpt, bit, blenderbot, blenderbot-small, blip, blip-2, bloom, bridgetower, bros, camembert, canine, chameleon, chinese_clip, chinese_clip_vision_model, clap, clip, clip_text_model, clip_vision_model, clipseg, clvp, code_llama, codegen, cohere, cohere2, colpali, conditional_detr, convbert, convnext, convnextv2, cpmant, ctrl, cvt, dab-detr, dac, data2vec-audio, data2vec-text, data2vec-vision, dbrx, deberta, deberta-v2, decision_transformer, deepseek_v3, deformable_detr, deit, depth_anything, depth_pro, deta, detr, diffllama, dinat, dinov2, dinov2_with_registers, distilbert, donut-swin, dpr, dpt, efficientformer, efficientnet, electra, emu3, encodec, encoder-decoder, ernie, ernie_m, esm, falcon, falcon_mamba, fastspeech2_conformer, flaubert, flava, fnet, focalnet, fsmt, funnel, fuyu, gemma, gemma2, gemma3, gemma3_text, git, glm, glm4, glpn, got_ocr2, gpt-sw3, gpt2, gpt_bigcode, gpt_neo, gpt_neox, gpt_neox_japanese, gptj, gptsan-japanese, granite, granitemoe, granitemoeshared, granitevision, graphormer, grounding-dino, groupvit, helium, hiera, hubert, ibert, idefics, idefics2, idefics3, idefics3_vision, ijepa, imagegpt, informer, instructblip, instructblipvideo, jamba, jetmoe, jukebox, kosmos-2, layoutlm, layoutlmv2, layoutlmv3, led, levit, lilt, llama, llama4, llama4_text, llava, llava_next, llava_next_video, llava_onevision, longformer, longt5, luke, lxmert, m2m_100, mamba, mamba2, marian, markuplm, mask2former, maskformer, maskformer-swin, mbart, mctct, mega, megatron-bert, mgp-str, mimi, mistral, mistral3, mixtral, mllama, mobilebert, mobilenet_v1, mobilenet_v2, mobilevit, mobilevitv2, modernbert, moonshine, moshi, mpnet, mpt, mra, mt5, musicgen, musicgen_melody, mvp, nat, nemotron, nezha, nllb-moe, nougat, nystromformer, olmo, olmo2, olmoe, omdet-turbo, oneformer, open-llama, openai-gpt, opt, owlv2, owlvit, paligemma, patchtsmixer, patchtst, pegasus, pegasus_x, perceiver, persimmon, phi, phi3, phi4_multimodal, phimoe, pix2struct, pixtral, plbart, poolformer, pop2piano, prompt_depth_anything, prophetnet, pvt, pvt_v2, qdqbert, qwen2, qwen2_5_vl, qwen2_audio, qwen2_audio_encoder, qwen2_moe, qwen2_vl, qwen3, qwen3_moe, rag, realm, recurrent_gemma, reformer, regnet, rembert, resnet, retribert, roberta, roberta-prelayernorm, roc_bert, roformer, rt_detr, rt_detr_resnet, rt_detr_v2, rwkv, sam, sam_vision_model, seamless_m4t, seamless_m4t_v2, segformer, seggpt, sew, sew-d, shieldgemma2, siglip, siglip2, siglip_vision_model, smolvlm, smolvlm_vision, speech-encoder-decoder, speech_to_text, speech_to_text_2, speecht5, splinter, squeezebert, stablelm, starcoder2, superglue, superpoint, swiftformer, swin, swin2sr, swinv2, switch_transformers, t5, table-transformer, tapas, textnet, time_series_transformer, timesformer, timm_backbone, timm_wrapper, trajectory_transformer, transfo-xl, trocr, tvlt, tvp, udop, umt5, unispeech, unispeech-sat, univnet, upernet, van, video_llava, videomae, vilt, vipllava, vision-encoder-decoder, vision-text-dual-encoder, visual_bert, vit, vit_hybrid, vit_mae, vit_msn, vitdet, vitmatte, vitpose, vitpose_backbone, vits, vivit, wav2vec2, wav2vec2-bert, wav2vec2-conformer, wavlm, whisper, xclip, xglm, xlm, xlm-prophetnet, xlm-roberta, xlm-roberta-xl, xlnet, xmod, yolos, yoso, zamba, zamba2, zoedepth

## 5. LLM-Based Classification {#llm}

BilbyStats also supports classification using large language models. Let's try different approaches.

### 5.1 Single Sample Classification

In [8]:
# Test single sample classification with different models
sample_text = test_texts[0]
true_label = test_labels[0]

print(f"Sample text: {sample_text}")
print(f"True label: {true_label}")

# Try different LLM models (adjust based on your available models)
models_to_test = ["gpt-4o-mini", "claude", "gemini", "llama"]

for model in models_to_test:
    try:
        sentiment = bs.detect_sentiment(sample_text, model_name=model)
        print(f"{model} prediction: {sentiment}")
    except Exception as e:
        print(f"{model} error: {e}")

Sample text: gold falls as dollar strengthens on retail sales
True label: 0
gpt-4o-mini prediction: negative
claude prediction: negative
gemini prediction: negative

llama prediction: negative


### 5.2 Batch Classification with Custom Prompts

In [9]:
# Define prompt templates (you can also load these from files if they exist)
simple_prompt = """You are a sentiment analysis expert for financial text.
Classify each input sentence based on sentiment.

Return your answer in the following format:
Label: <one of [negative, neutral, positive]>

Be concise and precise."""

detailed_prompt = """You are a sentiment analysis expert for financial text.
Classify each input sentence based on sentiment.

Return your answer in the following format:
Label: <one of [negative, neutral, positive]>
Explanation: <brief explanation (1–2 sentences) of why this label applies>

Be concise and precise."""

print("Simple prompt:")
print(simple_prompt)
print("\nDetailed prompt:")
print(detailed_prompt)

Simple prompt:
You are a sentiment analysis expert for financial text.
Classify each input sentence based on sentiment.

Return your answer in the following format:
Label: <one of [negative, neutral, positive]>

Be concise and precise.

Detailed prompt:
You are a sentiment analysis expert for financial text.
Classify each input sentence based on sentiment.

Return your answer in the following format:
Label: <one of [negative, neutral, positive]>
Explanation: <brief explanation (1–2 sentences) of why this label applies>

Be concise and precise.


In [10]:
# Classify a small batch using api2data function
# First, create a small test dataset
test_sample = df_subset.head(10).copy()

# Save to parquet for api2data function
test_input_path = "./test_sample_input.parquet"
test_output_path = "./test_sample_output.parquet"
test_sample.to_parquet(test_input_path)

# Save the prompt to a file
prompt_path = "./sentiment_classification.txt"
with open(prompt_path, 'w') as f:
    f.write(simple_prompt)

# Run batch classification (adjust model_name based on availability)
try:
    bs.api2data(
        colname='News',
        promptloc=prompt_path,
        dataloc=test_input_path,
        saveloc=test_output_path,
        task='Classify the sentiment of the following news text: ',
        model_name="llama",  # Change to available model
        labels=['Label:'],
        names=['sentiment'],
        lowercase=[True]
    )
    
    # Load and examine results
    llm_results = pd.read_parquet(test_output_path)
    print("LLM Classification Results:")
    print(llm_results[['News', 'Price Direction Up', 'llama_sentiment']].head())
    
except Exception as e:
    print(f"LLM classification error: {e}")
    print("This might be due to model availability or API configuration.")

LLM classification error: api2data() got an unexpected keyword argument 'labels'
This might be due to model availability or API configuration.


## 6. Model Evaluation and Comparison {#evaluation}

Let's compare the performance of all our models.

In [None]:
# Create a comparison DataFrame
results_comparison = pd.DataFrame({
    'Model': ['Logistic Regression', 'DistilBERT'],
    'Accuracy': [lr_accuracy, transformer_accuracy],
    'Type': ['Traditional ML', 'Transformer']
})

print("Model Comparison:")
print(results_comparison)

# Plot comparison
plt.figure(figsize=(10, 6))
plt.bar(results_comparison['Model'], results_comparison['Accuracy'])
plt.title('Model Performance Comparison')
plt.ylabel('Accuracy')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

### 6.1 Advanced Evaluation with BilbyStats

In [None]:
# Use BilbyStats' built-in prediction functions for more detailed evaluation
try:
    df_predictions = bs.predict_df(
        df_subset, 
        covariate, 
        model_path, 
        model_name, 
        indices['test'], 
        target
    )
    
    # Calculate metrics
    test_accuracy = accuracy_score(df_predictions['true_labels'], df_predictions['pred_labels'])
    print(f"Test Accuracy (using predict_df): {test_accuracy:.4f}")
    
    # Analyze prediction confidence
    logits = df_predictions['logits']
    confidence_scores = np.max(logits, axis=1)
    print(f"Average confidence: {np.mean(confidence_scores):.4f}")
    print(f"Min confidence: {np.min(confidence_scores):.4f}")
    print(f"Max confidence: {np.max(confidence_scores):.4f}")
    
except Exception as e:
    print(f"Advanced evaluation error: {e}")

## 7. Advanced Topics {#advanced}

### 7.1 Text Similarity Analysis

In [None]:
# Compare similarity between different texts
text1 = "Gold prices surge amid market uncertainty"
text2 = "Gold futures rise due to economic concerns"
text3 = "Stock market crashes heavily today"

# Calculate cosine similarity using different models
models = ['distilbert-base-uncased', 'roberta-large']

for model in models:
    try:
        sim_1_2 = bs.cos_sim(text1, text2, model)
        sim_1_3 = bs.cos_sim(text1, text3, model)
        print(f"{model}:")
        print(f"  Gold texts similarity: {sim_1_2:.4f}")
        print(f"  Gold vs Stock similarity: {sim_1_3:.4f}")
    except Exception as e:
        print(f"Error with {model}: {e}")

### 7.2 Sentiment Analysis with FinBERT

In [None]:
# Use specialized financial sentiment model
financial_texts = [
    "Gold prices declined sharply today",
    "Strong economic indicators boost market confidence",
    "Uncertainty in global markets affects gold demand"
]

for text in financial_texts:
    try:
        score = bs.get_sentiment_score(text)
        print(f"Text: {text}")
        print(f"Sentiment Score: {score:.4f}\n")
    except Exception as e:
        print(f"Error analyzing: {text}, Error: {e}")

### 7.3 Translation and Multilingual Analysis

In [None]:
# Test translation capabilities
chinese_text = "黄金价格今天上涨了"

try:
    english_translation = bs.translate(chinese_text, model_name="llama", languageout="English")
    
    print(f"Chinese: {chinese_text}")
    print(f"English: {english_translation}")
    
    # Analyze sentiment in both languages
    chinese_sentiment = bs.detect_sentiment(chinese_text, model_name="llama")
    english_sentiment = bs.detect_sentiment(english_translation, model_name="llama")
    
    print(f"Chinese sentiment: {chinese_sentiment}")
    print(f"English sentiment: {english_sentiment}")
    
except Exception as e:
    print(f"Translation error: {e}")

### 7.4 Cost Estimation for LLM Usage

In [None]:
# Estimate costs for LLM API calls
sample_input = "Gold prices are volatile today"
sample_output = "negative"
model_name_cost = "gpt-4o"

try:
    costs = bs.model_costs(sample_input, sample_output, model_name_cost, ndocs=1)
    print("Cost Analysis:")
    print(f"Input tokens: {costs['n_input_tokens']}")
    print(f"Output tokens: {costs['m_output_tokens']}")
    print(f"Total cost: ${costs['total_cost']:.6f}")
except Exception as e:
    print(f"Cost calculation error: {e}")

### 7.5 Data Generation and Augmentation

In [None]:
# Example of data generation using BilbyStats
# Extract random sentences from text data
sample_text = df['News'].iloc[0]
print(f"Original text: {sample_text}")

try:
    # Get sentences from text
    sentences = bs.get_sentences(sample_text, minlen=10, language='English')
    print(f"\nExtracted sentences ({len(sentences)}):")
    for i, sentence in enumerate(sentences[:3]):  # Show first 3
        print(f"{i+1}. {sentence}")
    
    # Get a random sentence
    if sentences:
        random_sentence = bs.get_random_sentence(sample_text, minlen=10)
        print(f"\nRandom sentence: {random_sentence}")
        
except Exception as e:
    print(f"Sentence extraction error: {e}")

### 7.6 Performance Timing

In [None]:
# Compare inference times for different approaches
sample_texts = test_texts[:5]  # Use first 5 test samples

# Time traditional ML prediction
bs.tic()
X_sample = vectorizer.transform(sample_texts)
lr_preds = lr_model.predict(X_sample)
lr_time = bs.toc()

print(f"Logistic Regression predictions: {lr_preds}")
print(f"Time taken: {lr_time:.4f} seconds")

# Time transformer prediction (if model is available)
try:
    bs.tic()
    # Create small dataset for transformer
    sample_dataset = bs.df2dict(pd.DataFrame({'text': sample_texts}), 'text')
    transformer_preds = bs.predict(sample_dataset, model_path, model_name)
    transformer_time = bs.toc()
    
    print(f"\nTransformer predictions: {transformer_preds['pred_labels']}")
    print(f"Time taken: {transformer_time:.4f} seconds")
    
except Exception as e:
    print(f"\nTransformer timing error: {e}")

## Summary and Best Practices

1. **Start Simple**: Begin with traditional ML models to establish baselines
2. **Use Transformers for Better Performance**: DistilBERT and similar models often provide significant improvements
3. **Consider LLMs for Complex Tasks**: Use GPT, Claude, or Gemini for tasks requiring reasoning
4. **Monitor Costs**: Be aware of API costs when using commercial LLMs
5. **Validate Results**: Always use proper train/validation/test splits
6. **Choose Appropriate Models**: Consider your specific domain (e.g., FinBERT for financial texts)
7. **Time vs Accuracy Trade-offs**: Traditional ML is fastest, transformers balance speed and accuracy, LLMs are most flexible but slowest

## Next Steps

- Experiment with different transformer architectures
- Try few-shot learning with LLMs
- Implement active learning pipelines
- Explore multi-label classification
- Consider model ensembling techniques
- Use domain-specific models (like FinBERT for financial data)

This tutorial provides a comprehensive introduction to classification with BilbyStats. Adapt the code to your specific use case and data!