diff --git a/gamesense/README.md b/gamesense/README.md
index f5b41ca99..8e3353c3d 100644
--- a/gamesense/README.md
+++ b/gamesense/README.md
@@ -1,27 +1,78 @@
-# 🎮 GameSense: The LLM That Understands Gamers
+# 🎮 GameSense: An LLM That Transforms Gaming Conversations into Structured Data
-Elevate your gaming platform with an AI that translates player language into actionable data. A model that understands gaming terminology, extracts key attributes, and structures conversations for intelligent recommendations and support.
+GameSense is a specialized language model that converts unstructured gaming conversations into structured, actionable data. It listens to how gamers talk and extracts valuable information that can power recommendations, support systems, and analytics.
-## 🚀 Product Overview
+## 🎯 What GameSense Does
-GameSense is a specialized language model designed specifically for gaming platforms and communities. By fine-tuning powerful open-source LLMs on gaming conversations and terminology, GameSense can:
+**Input**: Gamers' natural language about games from forums, chats, reviews, etc.
-- **Understand Gaming Jargon**: Recognize specialized terms across different game genres and communities
-- **Extract Player Sentiment**: Identify frustrations, excitement, and other emotions in player communications
-- **Structure Unstructured Data**: Transform casual player conversations into structured, actionable data
-- **Generate Personalized Responses**: Create contextually appropriate replies that resonate with gamers
-- **Power Intelligent Recommendations**: Suggest games, content, or solutions based on player preferences and history
+**Output**: Structured data with categorized information about games, platforms, preferences, etc.
-Built on ZenML's enterprise-grade MLOps framework, GameSense delivers a production-ready solution that can be deployed, monitored, and continuously improved with minimal engineering overhead.
+Here's a concrete example from our training data:
-## 💡 How It Works
+### Input Example (Gaming Conversation)
+```
+"Dirt: Showdown from 2012 is a sport racing game for the PlayStation, Xbox, PC rated E 10+ (for Everyone 10 and Older). It's not available on Steam, Linux, or Mac."
+```
+
+### Output Example (Structured Information)
+```
+inform(
+ name[Dirt: Showdown],
+ release_year[2012],
+ esrb[E 10+ (for Everyone 10 and Older)],
+ genres[driving/racing, sport],
+ platforms[PlayStation, Xbox, PC],
+ available_on_steam[no],
+ has_linux_release[no],
+ has_mac_release[no]
+)
+```
+
+This structured output can be used to:
+- Answer specific questions about games ("Is Dirt: Showdown available on Mac?")
+- Track trends in gaming discussions
+- Power recommendation engines
+- Extract user opinions and sentiment
+- Build gaming knowledge graphs
+- Enhance customer support
+
+## 🚀 How GameSense Transforms Gaming Conversations
+
+GameSense listens to gaming chats, forum posts, customer support tickets, social media, and other sources where gamers communicate. As gamers discuss different titles, features, opinions, and issues, GameSense:
+
+1. **Recognizes gaming jargon** across different genres and communities
+2. **Extracts key information** about games, platforms, features, and opinions
+3. **Structures this information** into a standardized format
+4. **Makes it available** for downstream applications
+
+## 💡 Real-World Applications
-GameSense leverages Parameter-Efficient Fine-Tuning (PEFT) techniques to customize powerful foundation models like Microsoft's Phi-2 or Llama 3.1 for gaming-specific applications. The system follows a streamlined pipeline:
+### Community Analysis
+Monitor conversations across Discord, Reddit, and other platforms to track what games are being discussed, what features players care about, and emerging trends.
-1. **Data Preparation**: Gaming conversations are processed and tokenized
-2. **Model Fine-Tuning**: The base model is efficiently customized using LoRA adapters
-3. **Evaluation**: The model is rigorously tested against gaming-specific benchmarks
-4. **Deployment**: High-performing models are automatically promoted to production
+### Intelligent Customer Support
+When a player says: "I can't get Dirt: Showdown to run on my Mac," GameSense identifies:
+- The specific game (Dirt: Showdown)
+- The platform issue (Mac)
+- The fact that the game doesn't support Mac (from structured knowledge)
+- Can immediately inform the player about platform incompatibility
+
+### Smart Recommendations
+When a player has been discussing racing games for PlayStation with family-friendly ratings, GameSense can help power recommendations for similar titles they might enjoy.
+
+### Automated Content Moderation
+By understanding the context of gaming conversations, GameSense can better identify toxic behavior while recognizing harmless gaming slang.
+
+## 🧠Technical Approach
+
+GameSense uses Parameter-Efficient Fine-Tuning (PEFT) to customize powerful foundation models for understanding gaming language:
+
+1. We start with a base model like Microsoft's Phi-2 or Llama 3.1
+2. Fine-tune on the gem/viggo dataset containing structured gaming conversations
+3. Use LoRA adapters for efficient training
+4. Evaluate on gaming-specific benchmarks
+5. Deploy to production environments
@@ -46,6 +97,16 @@ GameSense leverages Parameter-Efficient Fine-Tuning (PEFT) techniques to customi
- Python 3.8+
- GPU with at least 24GB VRAM (for full model training)
- ZenML installed and configured
+- Neptune.ai account for experiment tracking (optional)
+
+### Environment Setup
+
+1. Set up your Neptune.ai credentials if you want to use Neptune for experiment tracking:
+ ```bash
+ # Set your Neptune project name and API token as environment variables
+ export NEPTUNE_PROJECT="your-neptune-workspace/your-project-name"
+ export NEPTUNE_API_TOKEN="your-neptune-api-token"
+ ```
### Quick Setup
@@ -95,6 +156,17 @@ python run.py --config configs/llama3-1_finetune_local.yaml
> - For remote finetuning: [`llama3-1_finetune_remote.yaml`](configs/llama3-1_finetune_remote.yaml)
> - For local finetuning: [`llama3-1_finetune_local.yaml`](configs/llama3-1_finetune_local.yaml)
+### Dataset Configuration
+
+By default, GameSense uses the gem/viggo dataset, which contains structured gaming information like:
+
+| gem_id | meaning_representation | target | references |
+|--------|------------------------|--------|------------|
+| viggo-train-0 | inform(name[Dirt: Showdown], release_year[2012], esrb[E 10+ (for Everyone 10 and Older)], genres[driving/racing, sport], platforms[PlayStation, Xbox, PC], available_on_steam[no], has_linux_release[no], has_mac_release[no]) | Dirt: Showdown from 2012 is a sport racing game for the PlayStation, Xbox, PC rated E 10+ (for Everyone 10 and Older). It's not available on Steam, Linux, or Mac. | [Dirt: Showdown from 2012 is a sport racing game for the PlayStation, Xbox, PC rated E 10+ (for Everyone 10 and Older). It's not available on Steam, Linux, or Mac.] |
+| viggo-train-1 | inform(name[Dirt: Showdown], release_year[2012], esrb[E 10+...]) | Dirt: Showdown is a sport racing game... | [Dirt: Showdown is a sport racing game...] |
+
+You can also train on your own gaming conversations by formatting them in a similar structure and updating the configuration.
+
### Training Acceleration
For faster training on high-end hardware:
@@ -148,7 +220,7 @@ For detailed instructions on data preparation, see our [data customization guide
GameSense includes built-in evaluation using industry-standard metrics:
-- **ROUGE Scores**: Measure response quality and relevance
+- **ROUGE Scores**: Measure how well the model can generate natural language from structured data
- **Gaming-Specific Benchmarks**: Evaluate understanding of gaming terminology
- **Automatic Model Promotion**: Only deploy models that meet quality thresholds
@@ -192,7 +264,7 @@ GameSense follows a modular architecture for easy customization:
To fine-tune GameSense on your specific gaming platform's data:
-1. **Format your dataset**: Prepare your gaming conversations in a structured format
+1. **Format your dataset**: Prepare your gaming conversations in a structured format similar to gem/viggo
2. **Update the configuration**: Point to your dataset in the config file
3. **Run the pipeline**: GameSense will automatically process and learn from your data
@@ -203,6 +275,55 @@ The [`prepare_data` step](steps/prepare_datasets.py) handles:
For custom data sources, you'll need to prepare the splits in a Hugging Face dataset format. The step returns paths to the stored datasets (`train`, `val`, and `test_raw` splits), with the test set tokenized later during evaluation.
+You can structure conversations from:
+- Game forums
+- Support tickets
+- Discord chats
+- Streaming chats
+- Reviews
+- Social media posts
+
## 📚 Documentation
For learning more about how to use ZenML to build your own MLOps pipelines, refer to our comprehensive [ZenML documentation](https://docs.zenml.io/).
+
+## Running on CPU-only Environment
+
+If you don't have access to a GPU, you can still run this project with the CPU-only configuration. We've made several optimizations to make this project work on CPU, including:
+
+- Smaller batch sizes for reduced memory footprint
+- Fewer training steps
+- Disabled GPU-specific features (quantization, bf16, etc.)
+- Using smaller test datasets for evaluation
+- Special handling for Phi-3.5 model caching issues on CPU
+
+To run the project on CPU:
+
+```bash
+python run.py --config phi3.5_finetune_cpu.yaml
+```
+
+Note that training on CPU will be significantly slower than training on a GPU. The CPU configuration uses:
+
+1. A smaller model (`phi-3.5-mini-instruct`) which is more CPU-friendly
+2. Reduced batch size and increased gradient accumulation steps
+3. Fewer total training steps (50 instead of 300)
+4. Half-precision (float16) where possible to reduce memory usage
+5. Smaller dataset subsets (100 training samples, 20 validation samples, 10 test samples)
+6. Special compatibility settings for Phi models running on CPU
+
+For best results, we recommend:
+- Using a machine with at least 16GB of RAM
+- Being patient! LLM training on CPU is much slower than on GPU
+- If you still encounter memory issues, try reducing the `max_train_samples` parameter even further in the config file
+
+### Known Issues and Workarounds
+
+Some large language models like Phi-3.5 have caching mechanisms that are optimized for GPU usage and may encounter issues when running on CPU. Our CPU configuration includes several workarounds:
+
+1. Disabling KV caching for model generation
+2. Using `torch.float16 data` type to reduce memory usage
+3. Disabling flash attention which isn't needed on CPU
+4. Using standard AdamW optimizer instead of 8-bit optimizers that require GPU
+
+These changes allow the model to run on CPU with less memory and avoid compatibility issues, although at the cost of some performance.
diff --git a/gamesense/configs/phi3.5_finetune_cpu.yaml b/gamesense/configs/phi3.5_finetune_cpu.yaml
new file mode 100644
index 000000000..0c243ec81
--- /dev/null
+++ b/gamesense/configs/phi3.5_finetune_cpu.yaml
@@ -0,0 +1,85 @@
+# Apache Software License 2.0
+#
+# Copyright (c) ZenML GmbH 2024. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+model:
+ name: llm-peft-phi-3.5-mini-instruct-cpu
+ description: "Fine-tune Phi-3.5-mini-instruct on CPU."
+ tags:
+ - llm
+ - peft
+ - phi-3.5
+ - cpu
+ version: 100_steps
+
+settings:
+ docker:
+ parent_image: pytorch/pytorch:2.2.2-runtime
+ requirements: requirements.txt
+ python_package_installer: uv
+ python_package_installer_args:
+ system: null
+ apt_packages:
+ - git
+ environment:
+ MKL_SERVICE_FORCE_INTEL: "1"
+ # Explicitly disable MPS
+ PYTORCH_ENABLE_MPS_FALLBACK: "0"
+ PYTORCH_MPS_HIGH_WATERMARK_RATIO: "0.0"
+
+parameters:
+ # Uses a smaller model for CPU training
+ base_model_id: microsoft/Phi-3.5-mini-instruct
+ use_fast: False
+ load_in_4bit: False
+ load_in_8bit: False
+ cpu_only: True # Enable CPU-only mode
+ # Extra conservative dataset size for CPU
+ max_train_samples: 50
+ max_val_samples: 10
+ max_test_samples: 5
+ system_prompt: |
+ Given a target sentence construct the underlying meaning representation of the input sentence as a single function with attributes and attribute values.
+ This function should describe the target string accurately and the function must be one of the following ['inform', 'request', 'give_opinion', 'confirm', 'verify_attribute', 'suggest', 'request_explanation', 'recommend', 'request_attribute'].
+ The attributes must be one of the following: ['name', 'exp_release_date', 'release_year', 'developer', 'esrb', 'rating', 'genres', 'player_perspective', 'has_multiplayer', 'platforms', 'available_on_steam', 'has_linux_release', 'has_mac_release', 'specifier']
+
+
+steps:
+ prepare_data:
+ parameters:
+ dataset_name: gem/viggo
+ # These settings are now defined at the pipeline level
+ # max_train_samples: 100
+ # max_val_samples: 20
+ # max_test_samples: 10
+
+ finetune:
+ parameters:
+ max_steps: 25 # Further reduced steps for CPU training
+ eval_steps: 5 # More frequent evaluation
+ bf16: False # Disable bf16 for CPU compatibility
+ per_device_train_batch_size: 1 # Smallest batch size for CPU
+ gradient_accumulation_steps: 2 # Reduced for CPU
+ optimizer: "adamw_torch" # Use standard AdamW rather than 8-bit for CPU
+ logging_steps: 2 # More frequent logging
+ save_steps: 25 # Save less frequently
+ save_total_limit: 1 # Keep only the best model
+ evaluation_strategy: "steps"
+
+ promote:
+ parameters:
+ metric: rouge2
+ target_stage: staging
\ No newline at end of file
diff --git a/gamesense/pipelines/train.py b/gamesense/pipelines/train.py
index c91a76381..71a042bf1 100644
--- a/gamesense/pipelines/train.py
+++ b/gamesense/pipelines/train.py
@@ -33,6 +33,10 @@ def llm_peft_full_finetune(
use_fast: bool = True,
load_in_8bit: bool = False,
load_in_4bit: bool = False,
+ cpu_only: bool = False,
+ max_train_samples: int = None,
+ max_val_samples: int = None,
+ max_test_samples: int = None,
):
"""Pipeline for finetuning an LLM with peft.
@@ -42,20 +46,39 @@ def llm_peft_full_finetune(
- finetune: finetune the model
- evaluate_model: evaluate the base and finetuned model
- promote: promote the model to the target stage, if evaluation was successful
+
+ Args:
+ system_prompt: The system prompt to use.
+ base_model_id: The base model id to use.
+ use_fast: Whether to use the fast tokenizer.
+ load_in_8bit: Whether to load in 8-bit precision (requires GPU).
+ load_in_4bit: Whether to load in 4-bit precision (requires GPU).
+ cpu_only: Whether to force using CPU only and disable quantization.
+ max_train_samples: Maximum number of training samples to use (for CPU or testing).
+ max_val_samples: Maximum number of validation samples to use (for CPU or testing).
+ max_test_samples: Maximum number of test samples to use (for CPU or testing).
"""
- if not load_in_8bit and not load_in_4bit:
- raise ValueError(
- "At least one of `load_in_8bit` and `load_in_4bit` must be True."
- )
- if load_in_4bit and load_in_8bit:
- raise ValueError(
- "Only one of `load_in_8bit` and `load_in_4bit` can be True."
- )
+ if not cpu_only:
+ if not load_in_8bit and not load_in_4bit:
+ raise ValueError(
+ "At least one of `load_in_8bit` and `load_in_4bit` must be True when not in CPU-only mode."
+ )
+ if load_in_4bit and load_in_8bit:
+ raise ValueError(
+ "Only one of `load_in_8bit` and `load_in_4bit` can be True."
+ )
+
+ if cpu_only:
+ load_in_8bit = False
+ load_in_4bit = False
datasets_dir = prepare_data(
base_model_id=base_model_id,
system_prompt=system_prompt,
use_fast=use_fast,
+ max_train_samples=max_train_samples,
+ max_val_samples=max_val_samples,
+ max_test_samples=max_test_samples,
)
evaluate_model(
@@ -66,6 +89,7 @@ def llm_peft_full_finetune(
use_fast=use_fast,
load_in_8bit=load_in_8bit,
load_in_4bit=load_in_4bit,
+ cpu_only=cpu_only,
id="evaluate_base",
)
log_metadata_from_step_artifact(
@@ -82,6 +106,8 @@ def llm_peft_full_finetune(
load_in_8bit=load_in_8bit,
load_in_4bit=load_in_4bit,
use_accelerate=False,
+ cpu_only=cpu_only,
+ bf16=not cpu_only,
)
evaluate_model(
@@ -92,6 +118,7 @@ def llm_peft_full_finetune(
use_fast=use_fast,
load_in_8bit=load_in_8bit,
load_in_4bit=load_in_4bit,
+ cpu_only=cpu_only,
id="evaluate_finetuned",
)
log_metadata_from_step_artifact(
diff --git a/gamesense/pipelines/train_accelerated.py b/gamesense/pipelines/train_accelerated.py
index de05601ea..7c74e3b54 100644
--- a/gamesense/pipelines/train_accelerated.py
+++ b/gamesense/pipelines/train_accelerated.py
@@ -34,6 +34,9 @@ def llm_peft_full_finetune(
use_fast: bool = True,
load_in_8bit: bool = False,
load_in_4bit: bool = False,
+ max_train_samples: int = None,
+ max_val_samples: int = None,
+ max_test_samples: int = None,
):
"""Pipeline for finetuning an LLM with peft.
@@ -43,6 +46,16 @@ def llm_peft_full_finetune(
- finetune: finetune the model
- evaluate_model: evaluate the base and finetuned model
- promote: promote the model to the target stage, if evaluation was successful
+
+ Args:
+ system_prompt: The system prompt to use.
+ base_model_id: The base model id to use.
+ use_fast: Whether to use the fast tokenizer.
+ load_in_8bit: Whether to load in 8-bit precision (requires GPU).
+ load_in_4bit: Whether to load in 4-bit precision (requires GPU).
+ max_train_samples: Maximum number of training samples to use (for CPU or testing).
+ max_val_samples: Maximum number of validation samples to use (for CPU or testing).
+ max_test_samples: Maximum number of test samples to use (for CPU or testing).
"""
if not load_in_8bit and not load_in_4bit:
raise ValueError(
@@ -57,6 +70,9 @@ def llm_peft_full_finetune(
base_model_id=base_model_id,
system_prompt=system_prompt,
use_fast=use_fast,
+ max_train_samples=max_train_samples,
+ max_val_samples=max_val_samples,
+ max_test_samples=max_test_samples,
)
evaluate_model(
diff --git a/gamesense/run.py b/gamesense/run.py
index 8b56d7073..3d97a0f25 100644
--- a/gamesense/run.py
+++ b/gamesense/run.py
@@ -76,7 +76,19 @@ def main(
if not config:
raise RuntimeError("Config file is required to run a pipeline.")
- pipeline_args["config_path"] = os.path.join(config_folder, config)
+ config_path = os.path.join(config_folder, config)
+ pipeline_args["config_path"] = config_path
+
+ # Display a message if using CPU configuration
+ if "cpu" in config:
+ print("\n" + "="*80)
+ print("RUNNING IN CPU-ONLY MODE")
+ print("This will use a CPU-optimized configuration with:")
+ print("- Smaller batch sizes")
+ print("- Fewer training steps")
+ print("- Disabled GPU-specific features (quantization, bf16, etc)")
+ print("Note: Training will be much slower but should require less memory")
+ print("="*80 + "\n")
if accelerate:
from pipelines.train_accelerated import llm_peft_full_finetune
diff --git a/gamesense/steps/evaluate_model.py b/gamesense/steps/evaluate_model.py
index 1c7c82067..39ee1f0cf 100644
--- a/gamesense/steps/evaluate_model.py
+++ b/gamesense/steps/evaluate_model.py
@@ -45,6 +45,7 @@ def evaluate_model(
use_fast: bool = True,
load_in_4bit: bool = False,
load_in_8bit: bool = False,
+ cpu_only: bool = False,
) -> None:
"""Evaluate the model with ROUGE metrics.
@@ -57,7 +58,13 @@ def evaluate_model(
use_fast: Whether to use the fast tokenizer.
load_in_4bit: Whether to load the model in 4bit mode.
load_in_8bit: Whether to load the model in 8bit mode.
+ cpu_only: Whether to force using CPU only and disable quantization.
"""
+ # Force disable GPU optimizations if in CPU-only mode
+ if cpu_only:
+ load_in_4bit = False
+ load_in_8bit = False
+
cleanup_gpu_memory(force=True)
# authenticate with Hugging Face for gated repos
@@ -79,7 +86,14 @@ def evaluate_model(
use_fast=use_fast,
)
test_dataset = load_from_disk(str((datasets_dir / "test_raw").absolute()))
- test_dataset = test_dataset[:50]
+
+ # Reduce dataset size for CPU evaluation to make it more manageable
+ if cpu_only:
+ logger.info("CPU-only mode: Using a smaller test dataset subset")
+ test_dataset = test_dataset[:10] # Use only 10 samples for CPU
+ else:
+ test_dataset = test_dataset[:50] # Use 50 samples for GPU
+
ground_truths = test_dataset["meaning_representation"]
tokenized_train_dataset = tokenize_for_eval(
test_dataset, tokenizer, system_prompt
@@ -92,6 +106,7 @@ def evaluate_model(
is_training=False,
load_in_4bit=load_in_4bit,
load_in_8bit=load_in_8bit,
+ cpu_only=cpu_only,
)
else:
logger.info("Generating using finetuned model...")
@@ -99,16 +114,106 @@ def evaluate_model(
ft_model_dir,
load_in_4bit=load_in_4bit,
load_in_8bit=load_in_8bit,
+ cpu_only=cpu_only,
)
model.eval()
+
+ # Adjust generation parameters for CPU
+ max_new_tokens = 30 if cpu_only else 100
+
+ # Preemptively disable use_cache for Phi models on CPU to avoid 'get_max_length' error
+ is_phi_model = "phi" in base_model_id.lower()
+ use_cache = not (is_phi_model and cpu_only)
+
+ if not use_cache:
+ logger.info("Preemptively disabling KV cache for Phi model on CPU")
+ if hasattr(model.config, "use_cache"):
+ model.config.use_cache = False
+
with torch.no_grad():
- predictions = model.generate(
- input_ids=tokenized_train_dataset["input_ids"],
- attention_mask=tokenized_train_dataset["attention_mask"],
- max_new_tokens=100,
- pad_token_id=2,
- )
+ try:
+ # Move inputs to the same device as the model
+ device = next(model.parameters()).device
+ input_ids = tokenized_train_dataset["input_ids"].to(device)
+ attention_mask = tokenized_train_dataset["attention_mask"].to(device)
+
+ # Generate with appropriate parameters
+ logger.info(f"Generating with use_cache={use_cache}")
+ predictions = model.generate(
+ input_ids=input_ids,
+ attention_mask=attention_mask,
+ max_new_tokens=max_new_tokens,
+ pad_token_id=2,
+ use_cache=use_cache, # Use the preemptively determined setting
+ do_sample=False # Use greedy decoding for more stable results on CPU
+ )
+ except (AttributeError, RuntimeError) as e:
+ logger.warning(f"Initial generation attempt failed with error: {str(e)}")
+
+ # First fallback: try with more safety settings
+ if "get_max_length" in str(e) or "DynamicCache" in str(e) or cpu_only:
+ logger.warning("Using fallback generation strategy with minimal parameters")
+ try:
+ # Force model to CPU if needed
+ if not str(next(model.parameters()).device) == "cpu":
+ logger.info("Moving model to CPU for generation")
+ model = model.to("cpu")
+
+ # Move inputs to CPU
+ input_ids = tokenized_train_dataset["input_ids"].to("cpu")
+ attention_mask = tokenized_train_dataset["attention_mask"].to("cpu")
+
+ predictions = model.generate(
+ input_ids=input_ids,
+ attention_mask=attention_mask,
+ max_new_tokens=20, # Even smaller for safety
+ pad_token_id=2,
+ use_cache=False, # Disable KV caching completely
+ do_sample=False, # Use greedy decoding
+ num_beams=1 # Simple beam search
+ )
+ except (RuntimeError, Exception) as e2:
+ logger.warning(f"Second generation attempt failed with error: {str(e2)}")
+
+ # Final fallback: process one sample at a time
+ logger.warning("Final fallback: processing one sample at a time")
+
+ # Process one sample at a time
+ all_predictions = []
+ batch_size = tokenized_train_dataset["input_ids"].shape[0]
+
+ for i in range(batch_size):
+ try:
+ # Process one sample at a time
+ single_input = tokenized_train_dataset["input_ids"][i:i+1].to("cpu")
+ single_attention = tokenized_train_dataset["attention_mask"][i:i+1].to("cpu")
+
+ single_pred = model.generate(
+ input_ids=single_input,
+ attention_mask=single_attention,
+ max_new_tokens=20, # Even further reduced for safety
+ num_beams=1,
+ do_sample=False,
+ use_cache=False,
+ pad_token_id=2,
+ )
+ all_predictions.append(single_pred)
+ except Exception as sample_error:
+ logger.error(f"Failed to generate for sample {i}: {str(sample_error)}")
+ # Create an empty prediction as placeholder
+ all_predictions.append(tokenized_train_dataset["input_ids"][i:i+1])
+
+ # Combine the individual predictions
+ if all_predictions:
+ predictions = torch.cat(all_predictions, dim=0)
+ else:
+ # If all samples failed, return original inputs
+ logger.error("All samples failed in generation. Using inputs as fallback.")
+ predictions = tokenized_train_dataset["input_ids"]
+ else:
+ # Re-raise if not a cache-related issue
+ raise e
predictions = tokenizer.batch_decode(
predictions[:, tokenized_train_dataset["input_ids"].shape[1] :],
skip_special_tokens=True,
diff --git a/gamesense/steps/finetune.py b/gamesense/steps/finetune.py
index 5421757d7..cea0804ee 100644
--- a/gamesense/steps/finetune.py
+++ b/gamesense/steps/finetune.py
@@ -50,11 +50,14 @@ def finetune(
per_device_train_batch_size: int = 2,
gradient_accumulation_steps: int = 4,
warmup_steps: int = 5,
- bf16: bool = True,
+ bf16: bool = False, # Changed to default False for CPU compatibility
use_accelerate: bool = False,
use_fast: bool = True,
load_in_4bit: bool = False,
load_in_8bit: bool = False,
+ cpu_only: bool = False,
+ save_total_limit: int = 1,
+ evaluation_strategy: str = "steps",
) -> Annotated[
Path, ArtifactConfig(name="ft_model_dir", artifact_type=ArtifactType.MODEL)
]:
@@ -82,10 +85,19 @@ def finetune(
use_fast: Whether to use the fast tokenizer.
load_in_4bit: Whether to load the model in 4bit mode.
load_in_8bit: Whether to load the model in 8bit mode.
+ cpu_only: Whether to force using CPU only and disable quantization.
+ save_total_limit: The total number of checkpoints to keep (None means keep all).
+ evaluation_strategy: The evaluation strategy to use (steps, epoch, or no).
Returns:
The path to the finetuned model directory.
"""
+ # Force disable GPU optimizations if in CPU-only mode
+ if cpu_only:
+ load_in_4bit = False
+ load_in_8bit = False
+ bf16 = False
+
cleanup_gpu_memory(force=True)
# authenticate with Hugging Face for gated repos
@@ -131,6 +143,7 @@ def finetune(
should_print=should_print,
load_in_4bit=load_in_4bit,
load_in_8bit=load_in_8bit,
+ cpu_only=cpu_only, # Pass the CPU-only flag to the model loader
)
trainer = transformers.Trainer(
@@ -160,11 +173,12 @@ def finetune(
save_steps=min(save_steps, max_steps)
if max_steps >= 0
else save_steps,
- evaluation_strategy="steps",
+ evaluation_strategy=evaluation_strategy,
eval_steps=eval_steps,
do_eval=True,
label_names=["input_ids"],
ddp_find_unused_parameters=False,
+ save_total_limit=save_total_limit,
),
data_collator=transformers.DataCollatorForLanguageModeling(
tokenizer, mlm=False
diff --git a/gamesense/steps/log_metadata.py b/gamesense/steps/log_metadata.py
index 14371b78b..d0dc4729f 100644
--- a/gamesense/steps/log_metadata.py
+++ b/gamesense/steps/log_metadata.py
@@ -17,7 +17,7 @@
from typing import Any, Dict
-from zenml import get_step_context, log_model_metadata, step
+from zenml import get_step_context, log_metadata, step
@step(enable_cache=False)
@@ -34,9 +34,11 @@ def log_metadata_from_step_artifact(
context = get_step_context()
metadata_dict: Dict[str, Any] = (
- context.pipeline_run.steps[step_name].outputs[artifact_name].load()
+ context.pipeline_run.steps[step_name].outputs[artifact_name]
)
- metadata = {artifact_name: metadata_dict}
-
- log_model_metadata(metadata)
+ log_metadata(
+ artifact_name=artifact_name,
+ metadata={"model_name": "phi3.5_finetune_cpu"},
+ infer_model=True,
+ )
diff --git a/gamesense/steps/prepare_datasets.py b/gamesense/steps/prepare_datasets.py
index 3e58b00e1..00711191e 100644
--- a/gamesense/steps/prepare_datasets.py
+++ b/gamesense/steps/prepare_datasets.py
@@ -32,6 +32,9 @@ def prepare_data(
system_prompt: str,
dataset_name: str = "gem/viggo",
use_fast: bool = True,
+ max_train_samples: int = None,
+ max_val_samples: int = None,
+ max_test_samples: int = None,
) -> Annotated[Path, "datasets_dir"]:
"""Prepare the datasets for finetuning.
@@ -40,18 +43,31 @@ def prepare_data(
system_prompt: The system prompt to use.
dataset_name: The name of the dataset to use.
use_fast: Whether to use the fast tokenizer.
+ max_train_samples: Maximum number of training samples to use (for CPU or testing).
+ max_val_samples: Maximum number of validation samples to use (for CPU or testing).
+ max_test_samples: Maximum number of test samples to use (for CPU or testing).
Returns:
The path to the datasets directory.
"""
from datasets import load_dataset
+ import logging
+ logger = logging.getLogger(__name__)
cleanup_gpu_memory(force=True)
+ # Set default values if None (to prevent validation errors)
+ max_train_samples = max_train_samples if max_train_samples is not None else 0
+ max_val_samples = max_val_samples if max_val_samples is not None else 0
+ max_test_samples = max_test_samples if max_test_samples is not None else 0
+
log_model_metadata(
{
"system_prompt": system_prompt,
"base_model_id": base_model_id,
+ "max_train_samples": max_train_samples,
+ "max_val_samples": max_val_samples,
+ "max_test_samples": max_test_samples,
}
)
@@ -62,23 +78,39 @@ def prepare_data(
system_prompt=system_prompt,
)
+ # Load and potentially limit the training dataset
train_dataset = load_dataset(
dataset_name,
split="train",
trust_remote_code=True,
)
+ if max_train_samples > 0 and max_train_samples < len(train_dataset):
+ logger.info(f"Limiting training dataset to {max_train_samples} samples (from {len(train_dataset)})")
+ train_dataset = train_dataset.select(range(max_train_samples))
+
tokenized_train_dataset = train_dataset.map(gen_and_tokenize)
+
+ # Load and potentially limit the validation dataset
eval_dataset = load_dataset(
dataset_name,
split="validation",
trust_remote_code=True,
)
+ if max_val_samples > 0 and max_val_samples < len(eval_dataset):
+ logger.info(f"Limiting validation dataset to {max_val_samples} samples (from {len(eval_dataset)})")
+ eval_dataset = eval_dataset.select(range(max_val_samples))
+
tokenized_val_dataset = eval_dataset.map(gen_and_tokenize)
+
+ # Load and potentially limit the test dataset
test_dataset = load_dataset(
dataset_name,
split="test",
trust_remote_code=True,
)
+ if max_test_samples > 0 and max_test_samples < len(test_dataset):
+ logger.info(f"Limiting test dataset to {max_test_samples} samples (from {len(test_dataset)})")
+ test_dataset = test_dataset.select(range(max_test_samples))
datasets_path = Path("datasets")
tokenized_train_dataset.save_to_disk(
diff --git a/gamesense/utils/loaders.py b/gamesense/utils/loaders.py
index 5ddeeae56..919c269bc 100644
--- a/gamesense/utils/loaders.py
+++ b/gamesense/utils/loaders.py
@@ -33,6 +33,7 @@ def load_base_model(
should_print: bool = True,
load_in_8bit: bool = False,
load_in_4bit: bool = False,
+ cpu_only: bool = False,
) -> Union[Any, Tuple[Any, Dataset, Dataset]]:
"""Load the base model.
@@ -45,37 +46,102 @@ def load_base_model(
should_print: Whether to print the trainable parameters.
load_in_8bit: Whether to load the model in 8-bit mode.
load_in_4bit: Whether to load the model in 4-bit mode.
+ cpu_only: Whether to force using CPU only and disable quantization.
Returns:
The base model.
"""
from accelerate import Accelerator
from transformers import BitsAndBytesConfig
+ import logging
+ logger = logging.getLogger(__name__)
+
+ # Explicitly disable MPS when in CPU-only mode
+ if cpu_only:
+ import os
+ os.environ["PYTORCH_MPS_HIGH_WATERMARK_RATIO"] = "0.0"
+ os.environ["PYTORCH_ENABLE_MPS_FALLBACK"] = "0"
+ # Force PyTorch to not use MPS
+ torch._C._set_mps_enabled(False) if hasattr(torch._C, "_set_mps_enabled") else None
+ # Set default device to CPU explicitly
+ torch.set_default_device("cpu")
+ logger.warning("Disabled MPS device for CPU-only mode.")
if use_accelerate:
accelerator = Accelerator()
device_map = {"": accelerator.process_index}
else:
- device_map = {"": torch.cuda.current_device()}
-
- bnb_config = BitsAndBytesConfig(
- load_in_8bit=load_in_8bit,
- load_in_4bit=load_in_4bit,
- bnb_4bit_use_double_quant=True,
- bnb_4bit_quant_type="nf4",
- bnb_4bit_compute_dtype=torch.bfloat16,
- )
+ # Check for available devices and use the best one
+ if cpu_only:
+ device_map = {"": "cpu"}
+ elif torch.cuda.is_available():
+ device_map = {"": torch.cuda.current_device()}
+ elif torch.backends.mps.is_available() and not cpu_only:
+ device_map = {"": "mps"}
+ else:
+ device_map = {"": "cpu"}
+
+ # Only use BitsAndBytes config if CUDA is available and quantization is requested
+ # and we're not in CPU-only mode
+ if (load_in_8bit or load_in_4bit) and torch.cuda.is_available() and not cpu_only:
+ bnb_config = BitsAndBytesConfig(
+ load_in_8bit=load_in_8bit,
+ load_in_4bit=load_in_4bit,
+ bnb_4bit_use_double_quant=True,
+ bnb_4bit_quant_type="nf4",
+ bnb_4bit_compute_dtype=torch.bfloat16,
+ )
+ else:
+ bnb_config = None
+ # Reset these flags if CUDA is not available or in CPU-only mode
+ load_in_8bit = False
+ load_in_4bit = False
+
+ # Print device information for debugging
+ if should_print:
+ print(f"Loading model on device: {device_map}")
+
+ # Use half precision for CPU to reduce memory usage if not in training
+ torch_dtype = torch.float16 if device_map[""] == "cpu" and not is_training else None
+
+ # Check if it's a Phi model
+ is_phi_model = "phi" in base_model_id.lower()
+
+ model_kwargs = {
+ "quantization_config": bnb_config,
+ "device_map": device_map,
+ "trust_remote_code": True,
+ "torch_dtype": torch_dtype,
+ # Use low_cpu_mem_usage for CPU training to minimize memory usage
+ "low_cpu_mem_usage": device_map[""] == "cpu",
+ }
+
+ # Add special config for Phi models on CPU to avoid cache issues
+ if is_phi_model and (cpu_only or device_map[""] == "cpu"):
+ if should_print:
+ print("Loading Phi model on CPU with special configuration to avoid caching issues")
+ model_kwargs["use_flash_attention_2"] = False
+ # Set attn_implementation to eager for Phi models on CPU
+ model_kwargs["attn_implementation"] = "eager"
model = AutoModelForCausalLM.from_pretrained(
base_model_id,
- quantization_config=bnb_config,
- device_map=device_map,
- trust_remote_code=True,
+ **model_kwargs
)
+ # For Phi models on CPU, disable kv cache feature to avoid errors
+ if is_phi_model and (cpu_only or device_map[""] == "cpu"):
+ if hasattr(model.config, "use_cache"):
+ model.config.use_cache = False
+ if should_print:
+ print("Disabled KV cache for Phi model on CPU to avoid errors")
+
if is_training:
model.gradient_checkpointing_enable()
- model = prepare_model_for_kbit_training(model)
+
+ # For CPU-only mode, skip prepare_model_for_kbit_training if not using quantization
+ if not (cpu_only and not (load_in_8bit or load_in_4bit)):
+ model = prepare_model_for_kbit_training(model)
config = LoraConfig(
r=8,
@@ -108,6 +174,7 @@ def load_pretrained_model(
ft_model_dir: Path,
load_in_4bit: bool = False,
load_in_8bit: bool = False,
+ cpu_only: bool = False,
) -> AutoModelForCausalLM:
"""Load the finetuned model saved in the output directory.
@@ -115,23 +182,76 @@ def load_pretrained_model(
ft_model_dir: The path to the finetuned model directory.
load_in_4bit: Whether to load the model in 4-bit mode.
load_in_8bit: Whether to load the model in 8-bit mode.
+ cpu_only: Whether to force using CPU only and disable quantization.
Returns:
The finetuned model.
"""
from transformers import BitsAndBytesConfig
+ import logging
+ logger = logging.getLogger(__name__)
- bnb_config = BitsAndBytesConfig(
- load_in_8bit=load_in_8bit,
- load_in_4bit=load_in_4bit,
- bnb_4bit_use_double_quant=True,
- bnb_4bit_quant_type="nf4",
- bnb_4bit_compute_dtype=torch.bfloat16,
- )
+ # Explicitly disable MPS when in CPU-only mode
+ if cpu_only:
+ import os
+ os.environ["PYTORCH_MPS_HIGH_WATERMARK_RATIO"] = "0.0"
+ os.environ["PYTORCH_ENABLE_MPS_FALLBACK"] = "0"
+ # Force PyTorch to not use MPS
+ torch._C._set_mps_enabled(False) if hasattr(torch._C, "_set_mps_enabled") else None
+ # Set default device to CPU explicitly
+ torch.set_default_device("cpu")
+ logger.warning("Disabled MPS device for CPU-only mode.")
+
+ # Set device map based on available hardware and settings
+ if cpu_only:
+ device_map = "cpu"
+ else:
+ device_map = "auto"
+
+ # Only use BitsAndBytes config if quantization is requested and we're not in CPU-only mode
+ if (load_in_8bit or load_in_4bit) and not cpu_only and torch.cuda.is_available():
+ bnb_config = BitsAndBytesConfig(
+ load_in_8bit=load_in_8bit,
+ load_in_4bit=load_in_4bit,
+ bnb_4bit_use_double_quant=True,
+ bnb_4bit_quant_type="nf4",
+ bnb_4bit_compute_dtype=torch.bfloat16,
+ )
+ else:
+ bnb_config = None
+
+ # Use half precision for CPU to reduce memory usage
+ torch_dtype = torch.float16 if device_map == "cpu" else None
+
+ # Special config for Phi models on CPU to avoid cache issues
+ # Check if it's a Phi model
+ is_phi_model = "phi" in str(ft_model_dir).lower()
+
+ model_kwargs = {
+ "quantization_config": bnb_config,
+ "device_map": device_map,
+ "trust_remote_code": True,
+ "torch_dtype": torch_dtype,
+ # Use low_cpu_mem_usage for CPU to minimize memory usage
+ "low_cpu_mem_usage": device_map == "cpu",
+ }
+
+ # Add special config for Phi models on CPU to avoid cache issues
+ if is_phi_model and (cpu_only or device_map == "cpu"):
+ logger.warning("Loading Phi model on CPU with special configuration to avoid caching issues")
+ model_kwargs["use_flash_attention_2"] = False
+ # Set attn_implementation to eager for Phi models on CPU
+ model_kwargs["attn_implementation"] = "eager"
+
model = AutoModelForCausalLM.from_pretrained(
ft_model_dir,
- quantization_config=bnb_config,
- device_map="auto",
- trust_remote_code=True,
+ **model_kwargs
)
+
+ # For Phi models on CPU, disable kv cache feature to avoid errors
+ if is_phi_model and (cpu_only or device_map == "cpu"):
+ if hasattr(model.config, "use_cache"):
+ model.config.use_cache = False
+ logger.warning("Disabled KV cache for Phi model on CPU to avoid errors")
+
return model
diff --git a/gamesense/utils/tokenizer.py b/gamesense/utils/tokenizer.py
index 6e92dfe34..66a55d785 100644
--- a/gamesense/utils/tokenizer.py
+++ b/gamesense/utils/tokenizer.py
@@ -17,6 +17,7 @@
from transformers import AutoTokenizer
+import torch
def load_tokenizer(
@@ -113,9 +114,7 @@ def tokenize_for_eval(
tokenizer: AutoTokenizer,
system_prompt: str,
):
- """Tokenizes the prompts for evaluation.
-
- This runs for the whole test dataset at once.
+ """Tokenize the data for evaluation.
Args:
data_points: The data points to tokenize.
@@ -123,11 +122,10 @@ def tokenize_for_eval(
system_prompt: The system prompt to use.
Returns:
- The tokenized prompt.
+ The tokenized data.
"""
eval_prompts = [
- f"""{system_prompt}
-
+ f"""
### Target sentence:
{data_point}
@@ -135,6 +133,8 @@ def tokenize_for_eval(
"""
for data_point in data_points["target"]
]
+ # Use the available device instead of hardcoding "cuda"
+ device = "mps" if torch.backends.mps.is_available() else "cpu"
return tokenizer(eval_prompts, padding="longest", return_tensors="pt").to(
- "cuda"
+ device
)