diff --git a/gamesense/README.md b/gamesense/README.md index f5b41ca99..8e3353c3d 100644 --- a/gamesense/README.md +++ b/gamesense/README.md @@ -1,27 +1,78 @@ -# 🎮 GameSense: The LLM That Understands Gamers +# 🎮 GameSense: An LLM That Transforms Gaming Conversations into Structured Data -Elevate your gaming platform with an AI that translates player language into actionable data. A model that understands gaming terminology, extracts key attributes, and structures conversations for intelligent recommendations and support. +GameSense is a specialized language model that converts unstructured gaming conversations into structured, actionable data. It listens to how gamers talk and extracts valuable information that can power recommendations, support systems, and analytics. -## 🚀 Product Overview +## 🎯 What GameSense Does -GameSense is a specialized language model designed specifically for gaming platforms and communities. By fine-tuning powerful open-source LLMs on gaming conversations and terminology, GameSense can: +**Input**: Gamers' natural language about games from forums, chats, reviews, etc. -- **Understand Gaming Jargon**: Recognize specialized terms across different game genres and communities -- **Extract Player Sentiment**: Identify frustrations, excitement, and other emotions in player communications -- **Structure Unstructured Data**: Transform casual player conversations into structured, actionable data -- **Generate Personalized Responses**: Create contextually appropriate replies that resonate with gamers -- **Power Intelligent Recommendations**: Suggest games, content, or solutions based on player preferences and history +**Output**: Structured data with categorized information about games, platforms, preferences, etc. -Built on ZenML's enterprise-grade MLOps framework, GameSense delivers a production-ready solution that can be deployed, monitored, and continuously improved with minimal engineering overhead. +Here's a concrete example from our training data: -## 💡 How It Works +### Input Example (Gaming Conversation) +``` +"Dirt: Showdown from 2012 is a sport racing game for the PlayStation, Xbox, PC rated E 10+ (for Everyone 10 and Older). It's not available on Steam, Linux, or Mac." +``` + +### Output Example (Structured Information) +``` +inform( + name[Dirt: Showdown], + release_year[2012], + esrb[E 10+ (for Everyone 10 and Older)], + genres[driving/racing, sport], + platforms[PlayStation, Xbox, PC], + available_on_steam[no], + has_linux_release[no], + has_mac_release[no] +) +``` + +This structured output can be used to: +- Answer specific questions about games ("Is Dirt: Showdown available on Mac?") +- Track trends in gaming discussions +- Power recommendation engines +- Extract user opinions and sentiment +- Build gaming knowledge graphs +- Enhance customer support + +## 🚀 How GameSense Transforms Gaming Conversations + +GameSense listens to gaming chats, forum posts, customer support tickets, social media, and other sources where gamers communicate. As gamers discuss different titles, features, opinions, and issues, GameSense: + +1. **Recognizes gaming jargon** across different genres and communities +2. **Extracts key information** about games, platforms, features, and opinions +3. **Structures this information** into a standardized format +4. **Makes it available** for downstream applications + +## 💡 Real-World Applications -GameSense leverages Parameter-Efficient Fine-Tuning (PEFT) techniques to customize powerful foundation models like Microsoft's Phi-2 or Llama 3.1 for gaming-specific applications. The system follows a streamlined pipeline: +### Community Analysis +Monitor conversations across Discord, Reddit, and other platforms to track what games are being discussed, what features players care about, and emerging trends. -1. **Data Preparation**: Gaming conversations are processed and tokenized -2. **Model Fine-Tuning**: The base model is efficiently customized using LoRA adapters -3. **Evaluation**: The model is rigorously tested against gaming-specific benchmarks -4. **Deployment**: High-performing models are automatically promoted to production +### Intelligent Customer Support +When a player says: "I can't get Dirt: Showdown to run on my Mac," GameSense identifies: +- The specific game (Dirt: Showdown) +- The platform issue (Mac) +- The fact that the game doesn't support Mac (from structured knowledge) +- Can immediately inform the player about platform incompatibility + +### Smart Recommendations +When a player has been discussing racing games for PlayStation with family-friendly ratings, GameSense can help power recommendations for similar titles they might enjoy. + +### Automated Content Moderation +By understanding the context of gaming conversations, GameSense can better identify toxic behavior while recognizing harmless gaming slang. + +## 🧠 Technical Approach + +GameSense uses Parameter-Efficient Fine-Tuning (PEFT) to customize powerful foundation models for understanding gaming language: + +1. We start with a base model like Microsoft's Phi-2 or Llama 3.1 +2. Fine-tune on the gem/viggo dataset containing structured gaming conversations +3. Use LoRA adapters for efficient training +4. Evaluate on gaming-specific benchmarks +5. Deploy to production environments

@@ -46,6 +97,16 @@ GameSense leverages Parameter-Efficient Fine-Tuning (PEFT) techniques to customi - Python 3.8+ - GPU with at least 24GB VRAM (for full model training) - ZenML installed and configured +- Neptune.ai account for experiment tracking (optional) + +### Environment Setup + +1. Set up your Neptune.ai credentials if you want to use Neptune for experiment tracking: + ```bash + # Set your Neptune project name and API token as environment variables + export NEPTUNE_PROJECT="your-neptune-workspace/your-project-name" + export NEPTUNE_API_TOKEN="your-neptune-api-token" + ``` ### Quick Setup @@ -95,6 +156,17 @@ python run.py --config configs/llama3-1_finetune_local.yaml > - For remote finetuning: [`llama3-1_finetune_remote.yaml`](configs/llama3-1_finetune_remote.yaml) > - For local finetuning: [`llama3-1_finetune_local.yaml`](configs/llama3-1_finetune_local.yaml) +### Dataset Configuration + +By default, GameSense uses the gem/viggo dataset, which contains structured gaming information like: + +| gem_id | meaning_representation | target | references | +|--------|------------------------|--------|------------| +| viggo-train-0 | inform(name[Dirt: Showdown], release_year[2012], esrb[E 10+ (for Everyone 10 and Older)], genres[driving/racing, sport], platforms[PlayStation, Xbox, PC], available_on_steam[no], has_linux_release[no], has_mac_release[no]) | Dirt: Showdown from 2012 is a sport racing game for the PlayStation, Xbox, PC rated E 10+ (for Everyone 10 and Older). It's not available on Steam, Linux, or Mac. | [Dirt: Showdown from 2012 is a sport racing game for the PlayStation, Xbox, PC rated E 10+ (for Everyone 10 and Older). It's not available on Steam, Linux, or Mac.] | +| viggo-train-1 | inform(name[Dirt: Showdown], release_year[2012], esrb[E 10+...]) | Dirt: Showdown is a sport racing game... | [Dirt: Showdown is a sport racing game...] | + +You can also train on your own gaming conversations by formatting them in a similar structure and updating the configuration. + ### Training Acceleration For faster training on high-end hardware: @@ -148,7 +220,7 @@ For detailed instructions on data preparation, see our [data customization guide GameSense includes built-in evaluation using industry-standard metrics: -- **ROUGE Scores**: Measure response quality and relevance +- **ROUGE Scores**: Measure how well the model can generate natural language from structured data - **Gaming-Specific Benchmarks**: Evaluate understanding of gaming terminology - **Automatic Model Promotion**: Only deploy models that meet quality thresholds @@ -192,7 +264,7 @@ GameSense follows a modular architecture for easy customization: To fine-tune GameSense on your specific gaming platform's data: -1. **Format your dataset**: Prepare your gaming conversations in a structured format +1. **Format your dataset**: Prepare your gaming conversations in a structured format similar to gem/viggo 2. **Update the configuration**: Point to your dataset in the config file 3. **Run the pipeline**: GameSense will automatically process and learn from your data @@ -203,6 +275,55 @@ The [`prepare_data` step](steps/prepare_datasets.py) handles: For custom data sources, you'll need to prepare the splits in a Hugging Face dataset format. The step returns paths to the stored datasets (`train`, `val`, and `test_raw` splits), with the test set tokenized later during evaluation. +You can structure conversations from: +- Game forums +- Support tickets +- Discord chats +- Streaming chats +- Reviews +- Social media posts + ## 📚 Documentation For learning more about how to use ZenML to build your own MLOps pipelines, refer to our comprehensive [ZenML documentation](https://docs.zenml.io/). + +## Running on CPU-only Environment + +If you don't have access to a GPU, you can still run this project with the CPU-only configuration. We've made several optimizations to make this project work on CPU, including: + +- Smaller batch sizes for reduced memory footprint +- Fewer training steps +- Disabled GPU-specific features (quantization, bf16, etc.) +- Using smaller test datasets for evaluation +- Special handling for Phi-3.5 model caching issues on CPU + +To run the project on CPU: + +```bash +python run.py --config phi3.5_finetune_cpu.yaml +``` + +Note that training on CPU will be significantly slower than training on a GPU. The CPU configuration uses: + +1. A smaller model (`phi-3.5-mini-instruct`) which is more CPU-friendly +2. Reduced batch size and increased gradient accumulation steps +3. Fewer total training steps (50 instead of 300) +4. Half-precision (float16) where possible to reduce memory usage +5. Smaller dataset subsets (100 training samples, 20 validation samples, 10 test samples) +6. Special compatibility settings for Phi models running on CPU + +For best results, we recommend: +- Using a machine with at least 16GB of RAM +- Being patient! LLM training on CPU is much slower than on GPU +- If you still encounter memory issues, try reducing the `max_train_samples` parameter even further in the config file + +### Known Issues and Workarounds + +Some large language models like Phi-3.5 have caching mechanisms that are optimized for GPU usage and may encounter issues when running on CPU. Our CPU configuration includes several workarounds: + +1. Disabling KV caching for model generation +2. Using `torch.float16 data` type to reduce memory usage +3. Disabling flash attention which isn't needed on CPU +4. Using standard AdamW optimizer instead of 8-bit optimizers that require GPU + +These changes allow the model to run on CPU with less memory and avoid compatibility issues, although at the cost of some performance. diff --git a/gamesense/configs/phi3.5_finetune_cpu.yaml b/gamesense/configs/phi3.5_finetune_cpu.yaml new file mode 100644 index 000000000..0c243ec81 --- /dev/null +++ b/gamesense/configs/phi3.5_finetune_cpu.yaml @@ -0,0 +1,85 @@ +# Apache Software License 2.0 +# +# Copyright (c) ZenML GmbH 2024. All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# + +model: + name: llm-peft-phi-3.5-mini-instruct-cpu + description: "Fine-tune Phi-3.5-mini-instruct on CPU." + tags: + - llm + - peft + - phi-3.5 + - cpu + version: 100_steps + +settings: + docker: + parent_image: pytorch/pytorch:2.2.2-runtime + requirements: requirements.txt + python_package_installer: uv + python_package_installer_args: + system: null + apt_packages: + - git + environment: + MKL_SERVICE_FORCE_INTEL: "1" + # Explicitly disable MPS + PYTORCH_ENABLE_MPS_FALLBACK: "0" + PYTORCH_MPS_HIGH_WATERMARK_RATIO: "0.0" + +parameters: + # Uses a smaller model for CPU training + base_model_id: microsoft/Phi-3.5-mini-instruct + use_fast: False + load_in_4bit: False + load_in_8bit: False + cpu_only: True # Enable CPU-only mode + # Extra conservative dataset size for CPU + max_train_samples: 50 + max_val_samples: 10 + max_test_samples: 5 + system_prompt: | + Given a target sentence construct the underlying meaning representation of the input sentence as a single function with attributes and attribute values. + This function should describe the target string accurately and the function must be one of the following ['inform', 'request', 'give_opinion', 'confirm', 'verify_attribute', 'suggest', 'request_explanation', 'recommend', 'request_attribute']. + The attributes must be one of the following: ['name', 'exp_release_date', 'release_year', 'developer', 'esrb', 'rating', 'genres', 'player_perspective', 'has_multiplayer', 'platforms', 'available_on_steam', 'has_linux_release', 'has_mac_release', 'specifier'] + + +steps: + prepare_data: + parameters: + dataset_name: gem/viggo + # These settings are now defined at the pipeline level + # max_train_samples: 100 + # max_val_samples: 20 + # max_test_samples: 10 + + finetune: + parameters: + max_steps: 25 # Further reduced steps for CPU training + eval_steps: 5 # More frequent evaluation + bf16: False # Disable bf16 for CPU compatibility + per_device_train_batch_size: 1 # Smallest batch size for CPU + gradient_accumulation_steps: 2 # Reduced for CPU + optimizer: "adamw_torch" # Use standard AdamW rather than 8-bit for CPU + logging_steps: 2 # More frequent logging + save_steps: 25 # Save less frequently + save_total_limit: 1 # Keep only the best model + evaluation_strategy: "steps" + + promote: + parameters: + metric: rouge2 + target_stage: staging \ No newline at end of file diff --git a/gamesense/pipelines/train.py b/gamesense/pipelines/train.py index c91a76381..71a042bf1 100644 --- a/gamesense/pipelines/train.py +++ b/gamesense/pipelines/train.py @@ -33,6 +33,10 @@ def llm_peft_full_finetune( use_fast: bool = True, load_in_8bit: bool = False, load_in_4bit: bool = False, + cpu_only: bool = False, + max_train_samples: int = None, + max_val_samples: int = None, + max_test_samples: int = None, ): """Pipeline for finetuning an LLM with peft. @@ -42,20 +46,39 @@ def llm_peft_full_finetune( - finetune: finetune the model - evaluate_model: evaluate the base and finetuned model - promote: promote the model to the target stage, if evaluation was successful + + Args: + system_prompt: The system prompt to use. + base_model_id: The base model id to use. + use_fast: Whether to use the fast tokenizer. + load_in_8bit: Whether to load in 8-bit precision (requires GPU). + load_in_4bit: Whether to load in 4-bit precision (requires GPU). + cpu_only: Whether to force using CPU only and disable quantization. + max_train_samples: Maximum number of training samples to use (for CPU or testing). + max_val_samples: Maximum number of validation samples to use (for CPU or testing). + max_test_samples: Maximum number of test samples to use (for CPU or testing). """ - if not load_in_8bit and not load_in_4bit: - raise ValueError( - "At least one of `load_in_8bit` and `load_in_4bit` must be True." - ) - if load_in_4bit and load_in_8bit: - raise ValueError( - "Only one of `load_in_8bit` and `load_in_4bit` can be True." - ) + if not cpu_only: + if not load_in_8bit and not load_in_4bit: + raise ValueError( + "At least one of `load_in_8bit` and `load_in_4bit` must be True when not in CPU-only mode." + ) + if load_in_4bit and load_in_8bit: + raise ValueError( + "Only one of `load_in_8bit` and `load_in_4bit` can be True." + ) + + if cpu_only: + load_in_8bit = False + load_in_4bit = False datasets_dir = prepare_data( base_model_id=base_model_id, system_prompt=system_prompt, use_fast=use_fast, + max_train_samples=max_train_samples, + max_val_samples=max_val_samples, + max_test_samples=max_test_samples, ) evaluate_model( @@ -66,6 +89,7 @@ def llm_peft_full_finetune( use_fast=use_fast, load_in_8bit=load_in_8bit, load_in_4bit=load_in_4bit, + cpu_only=cpu_only, id="evaluate_base", ) log_metadata_from_step_artifact( @@ -82,6 +106,8 @@ def llm_peft_full_finetune( load_in_8bit=load_in_8bit, load_in_4bit=load_in_4bit, use_accelerate=False, + cpu_only=cpu_only, + bf16=not cpu_only, ) evaluate_model( @@ -92,6 +118,7 @@ def llm_peft_full_finetune( use_fast=use_fast, load_in_8bit=load_in_8bit, load_in_4bit=load_in_4bit, + cpu_only=cpu_only, id="evaluate_finetuned", ) log_metadata_from_step_artifact( diff --git a/gamesense/pipelines/train_accelerated.py b/gamesense/pipelines/train_accelerated.py index de05601ea..7c74e3b54 100644 --- a/gamesense/pipelines/train_accelerated.py +++ b/gamesense/pipelines/train_accelerated.py @@ -34,6 +34,9 @@ def llm_peft_full_finetune( use_fast: bool = True, load_in_8bit: bool = False, load_in_4bit: bool = False, + max_train_samples: int = None, + max_val_samples: int = None, + max_test_samples: int = None, ): """Pipeline for finetuning an LLM with peft. @@ -43,6 +46,16 @@ def llm_peft_full_finetune( - finetune: finetune the model - evaluate_model: evaluate the base and finetuned model - promote: promote the model to the target stage, if evaluation was successful + + Args: + system_prompt: The system prompt to use. + base_model_id: The base model id to use. + use_fast: Whether to use the fast tokenizer. + load_in_8bit: Whether to load in 8-bit precision (requires GPU). + load_in_4bit: Whether to load in 4-bit precision (requires GPU). + max_train_samples: Maximum number of training samples to use (for CPU or testing). + max_val_samples: Maximum number of validation samples to use (for CPU or testing). + max_test_samples: Maximum number of test samples to use (for CPU or testing). """ if not load_in_8bit and not load_in_4bit: raise ValueError( @@ -57,6 +70,9 @@ def llm_peft_full_finetune( base_model_id=base_model_id, system_prompt=system_prompt, use_fast=use_fast, + max_train_samples=max_train_samples, + max_val_samples=max_val_samples, + max_test_samples=max_test_samples, ) evaluate_model( diff --git a/gamesense/run.py b/gamesense/run.py index 8b56d7073..3d97a0f25 100644 --- a/gamesense/run.py +++ b/gamesense/run.py @@ -76,7 +76,19 @@ def main( if not config: raise RuntimeError("Config file is required to run a pipeline.") - pipeline_args["config_path"] = os.path.join(config_folder, config) + config_path = os.path.join(config_folder, config) + pipeline_args["config_path"] = config_path + + # Display a message if using CPU configuration + if "cpu" in config: + print("\n" + "="*80) + print("RUNNING IN CPU-ONLY MODE") + print("This will use a CPU-optimized configuration with:") + print("- Smaller batch sizes") + print("- Fewer training steps") + print("- Disabled GPU-specific features (quantization, bf16, etc)") + print("Note: Training will be much slower but should require less memory") + print("="*80 + "\n") if accelerate: from pipelines.train_accelerated import llm_peft_full_finetune diff --git a/gamesense/steps/evaluate_model.py b/gamesense/steps/evaluate_model.py index 1c7c82067..39ee1f0cf 100644 --- a/gamesense/steps/evaluate_model.py +++ b/gamesense/steps/evaluate_model.py @@ -45,6 +45,7 @@ def evaluate_model( use_fast: bool = True, load_in_4bit: bool = False, load_in_8bit: bool = False, + cpu_only: bool = False, ) -> None: """Evaluate the model with ROUGE metrics. @@ -57,7 +58,13 @@ def evaluate_model( use_fast: Whether to use the fast tokenizer. load_in_4bit: Whether to load the model in 4bit mode. load_in_8bit: Whether to load the model in 8bit mode. + cpu_only: Whether to force using CPU only and disable quantization. """ + # Force disable GPU optimizations if in CPU-only mode + if cpu_only: + load_in_4bit = False + load_in_8bit = False + cleanup_gpu_memory(force=True) # authenticate with Hugging Face for gated repos @@ -79,7 +86,14 @@ def evaluate_model( use_fast=use_fast, ) test_dataset = load_from_disk(str((datasets_dir / "test_raw").absolute())) - test_dataset = test_dataset[:50] + + # Reduce dataset size for CPU evaluation to make it more manageable + if cpu_only: + logger.info("CPU-only mode: Using a smaller test dataset subset") + test_dataset = test_dataset[:10] # Use only 10 samples for CPU + else: + test_dataset = test_dataset[:50] # Use 50 samples for GPU + ground_truths = test_dataset["meaning_representation"] tokenized_train_dataset = tokenize_for_eval( test_dataset, tokenizer, system_prompt @@ -92,6 +106,7 @@ def evaluate_model( is_training=False, load_in_4bit=load_in_4bit, load_in_8bit=load_in_8bit, + cpu_only=cpu_only, ) else: logger.info("Generating using finetuned model...") @@ -99,16 +114,106 @@ def evaluate_model( ft_model_dir, load_in_4bit=load_in_4bit, load_in_8bit=load_in_8bit, + cpu_only=cpu_only, ) model.eval() + + # Adjust generation parameters for CPU + max_new_tokens = 30 if cpu_only else 100 + + # Preemptively disable use_cache for Phi models on CPU to avoid 'get_max_length' error + is_phi_model = "phi" in base_model_id.lower() + use_cache = not (is_phi_model and cpu_only) + + if not use_cache: + logger.info("Preemptively disabling KV cache for Phi model on CPU") + if hasattr(model.config, "use_cache"): + model.config.use_cache = False + with torch.no_grad(): - predictions = model.generate( - input_ids=tokenized_train_dataset["input_ids"], - attention_mask=tokenized_train_dataset["attention_mask"], - max_new_tokens=100, - pad_token_id=2, - ) + try: + # Move inputs to the same device as the model + device = next(model.parameters()).device + input_ids = tokenized_train_dataset["input_ids"].to(device) + attention_mask = tokenized_train_dataset["attention_mask"].to(device) + + # Generate with appropriate parameters + logger.info(f"Generating with use_cache={use_cache}") + predictions = model.generate( + input_ids=input_ids, + attention_mask=attention_mask, + max_new_tokens=max_new_tokens, + pad_token_id=2, + use_cache=use_cache, # Use the preemptively determined setting + do_sample=False # Use greedy decoding for more stable results on CPU + ) + except (AttributeError, RuntimeError) as e: + logger.warning(f"Initial generation attempt failed with error: {str(e)}") + + # First fallback: try with more safety settings + if "get_max_length" in str(e) or "DynamicCache" in str(e) or cpu_only: + logger.warning("Using fallback generation strategy with minimal parameters") + try: + # Force model to CPU if needed + if not str(next(model.parameters()).device) == "cpu": + logger.info("Moving model to CPU for generation") + model = model.to("cpu") + + # Move inputs to CPU + input_ids = tokenized_train_dataset["input_ids"].to("cpu") + attention_mask = tokenized_train_dataset["attention_mask"].to("cpu") + + predictions = model.generate( + input_ids=input_ids, + attention_mask=attention_mask, + max_new_tokens=20, # Even smaller for safety + pad_token_id=2, + use_cache=False, # Disable KV caching completely + do_sample=False, # Use greedy decoding + num_beams=1 # Simple beam search + ) + except (RuntimeError, Exception) as e2: + logger.warning(f"Second generation attempt failed with error: {str(e2)}") + + # Final fallback: process one sample at a time + logger.warning("Final fallback: processing one sample at a time") + + # Process one sample at a time + all_predictions = [] + batch_size = tokenized_train_dataset["input_ids"].shape[0] + + for i in range(batch_size): + try: + # Process one sample at a time + single_input = tokenized_train_dataset["input_ids"][i:i+1].to("cpu") + single_attention = tokenized_train_dataset["attention_mask"][i:i+1].to("cpu") + + single_pred = model.generate( + input_ids=single_input, + attention_mask=single_attention, + max_new_tokens=20, # Even further reduced for safety + num_beams=1, + do_sample=False, + use_cache=False, + pad_token_id=2, + ) + all_predictions.append(single_pred) + except Exception as sample_error: + logger.error(f"Failed to generate for sample {i}: {str(sample_error)}") + # Create an empty prediction as placeholder + all_predictions.append(tokenized_train_dataset["input_ids"][i:i+1]) + + # Combine the individual predictions + if all_predictions: + predictions = torch.cat(all_predictions, dim=0) + else: + # If all samples failed, return original inputs + logger.error("All samples failed in generation. Using inputs as fallback.") + predictions = tokenized_train_dataset["input_ids"] + else: + # Re-raise if not a cache-related issue + raise e predictions = tokenizer.batch_decode( predictions[:, tokenized_train_dataset["input_ids"].shape[1] :], skip_special_tokens=True, diff --git a/gamesense/steps/finetune.py b/gamesense/steps/finetune.py index 5421757d7..cea0804ee 100644 --- a/gamesense/steps/finetune.py +++ b/gamesense/steps/finetune.py @@ -50,11 +50,14 @@ def finetune( per_device_train_batch_size: int = 2, gradient_accumulation_steps: int = 4, warmup_steps: int = 5, - bf16: bool = True, + bf16: bool = False, # Changed to default False for CPU compatibility use_accelerate: bool = False, use_fast: bool = True, load_in_4bit: bool = False, load_in_8bit: bool = False, + cpu_only: bool = False, + save_total_limit: int = 1, + evaluation_strategy: str = "steps", ) -> Annotated[ Path, ArtifactConfig(name="ft_model_dir", artifact_type=ArtifactType.MODEL) ]: @@ -82,10 +85,19 @@ def finetune( use_fast: Whether to use the fast tokenizer. load_in_4bit: Whether to load the model in 4bit mode. load_in_8bit: Whether to load the model in 8bit mode. + cpu_only: Whether to force using CPU only and disable quantization. + save_total_limit: The total number of checkpoints to keep (None means keep all). + evaluation_strategy: The evaluation strategy to use (steps, epoch, or no). Returns: The path to the finetuned model directory. """ + # Force disable GPU optimizations if in CPU-only mode + if cpu_only: + load_in_4bit = False + load_in_8bit = False + bf16 = False + cleanup_gpu_memory(force=True) # authenticate with Hugging Face for gated repos @@ -131,6 +143,7 @@ def finetune( should_print=should_print, load_in_4bit=load_in_4bit, load_in_8bit=load_in_8bit, + cpu_only=cpu_only, # Pass the CPU-only flag to the model loader ) trainer = transformers.Trainer( @@ -160,11 +173,12 @@ def finetune( save_steps=min(save_steps, max_steps) if max_steps >= 0 else save_steps, - evaluation_strategy="steps", + evaluation_strategy=evaluation_strategy, eval_steps=eval_steps, do_eval=True, label_names=["input_ids"], ddp_find_unused_parameters=False, + save_total_limit=save_total_limit, ), data_collator=transformers.DataCollatorForLanguageModeling( tokenizer, mlm=False diff --git a/gamesense/steps/log_metadata.py b/gamesense/steps/log_metadata.py index 14371b78b..d0dc4729f 100644 --- a/gamesense/steps/log_metadata.py +++ b/gamesense/steps/log_metadata.py @@ -17,7 +17,7 @@ from typing import Any, Dict -from zenml import get_step_context, log_model_metadata, step +from zenml import get_step_context, log_metadata, step @step(enable_cache=False) @@ -34,9 +34,11 @@ def log_metadata_from_step_artifact( context = get_step_context() metadata_dict: Dict[str, Any] = ( - context.pipeline_run.steps[step_name].outputs[artifact_name].load() + context.pipeline_run.steps[step_name].outputs[artifact_name] ) - metadata = {artifact_name: metadata_dict} - - log_model_metadata(metadata) + log_metadata( + artifact_name=artifact_name, + metadata={"model_name": "phi3.5_finetune_cpu"}, + infer_model=True, + ) diff --git a/gamesense/steps/prepare_datasets.py b/gamesense/steps/prepare_datasets.py index 3e58b00e1..00711191e 100644 --- a/gamesense/steps/prepare_datasets.py +++ b/gamesense/steps/prepare_datasets.py @@ -32,6 +32,9 @@ def prepare_data( system_prompt: str, dataset_name: str = "gem/viggo", use_fast: bool = True, + max_train_samples: int = None, + max_val_samples: int = None, + max_test_samples: int = None, ) -> Annotated[Path, "datasets_dir"]: """Prepare the datasets for finetuning. @@ -40,18 +43,31 @@ def prepare_data( system_prompt: The system prompt to use. dataset_name: The name of the dataset to use. use_fast: Whether to use the fast tokenizer. + max_train_samples: Maximum number of training samples to use (for CPU or testing). + max_val_samples: Maximum number of validation samples to use (for CPU or testing). + max_test_samples: Maximum number of test samples to use (for CPU or testing). Returns: The path to the datasets directory. """ from datasets import load_dataset + import logging + logger = logging.getLogger(__name__) cleanup_gpu_memory(force=True) + # Set default values if None (to prevent validation errors) + max_train_samples = max_train_samples if max_train_samples is not None else 0 + max_val_samples = max_val_samples if max_val_samples is not None else 0 + max_test_samples = max_test_samples if max_test_samples is not None else 0 + log_model_metadata( { "system_prompt": system_prompt, "base_model_id": base_model_id, + "max_train_samples": max_train_samples, + "max_val_samples": max_val_samples, + "max_test_samples": max_test_samples, } ) @@ -62,23 +78,39 @@ def prepare_data( system_prompt=system_prompt, ) + # Load and potentially limit the training dataset train_dataset = load_dataset( dataset_name, split="train", trust_remote_code=True, ) + if max_train_samples > 0 and max_train_samples < len(train_dataset): + logger.info(f"Limiting training dataset to {max_train_samples} samples (from {len(train_dataset)})") + train_dataset = train_dataset.select(range(max_train_samples)) + tokenized_train_dataset = train_dataset.map(gen_and_tokenize) + + # Load and potentially limit the validation dataset eval_dataset = load_dataset( dataset_name, split="validation", trust_remote_code=True, ) + if max_val_samples > 0 and max_val_samples < len(eval_dataset): + logger.info(f"Limiting validation dataset to {max_val_samples} samples (from {len(eval_dataset)})") + eval_dataset = eval_dataset.select(range(max_val_samples)) + tokenized_val_dataset = eval_dataset.map(gen_and_tokenize) + + # Load and potentially limit the test dataset test_dataset = load_dataset( dataset_name, split="test", trust_remote_code=True, ) + if max_test_samples > 0 and max_test_samples < len(test_dataset): + logger.info(f"Limiting test dataset to {max_test_samples} samples (from {len(test_dataset)})") + test_dataset = test_dataset.select(range(max_test_samples)) datasets_path = Path("datasets") tokenized_train_dataset.save_to_disk( diff --git a/gamesense/utils/loaders.py b/gamesense/utils/loaders.py index 5ddeeae56..919c269bc 100644 --- a/gamesense/utils/loaders.py +++ b/gamesense/utils/loaders.py @@ -33,6 +33,7 @@ def load_base_model( should_print: bool = True, load_in_8bit: bool = False, load_in_4bit: bool = False, + cpu_only: bool = False, ) -> Union[Any, Tuple[Any, Dataset, Dataset]]: """Load the base model. @@ -45,37 +46,102 @@ def load_base_model( should_print: Whether to print the trainable parameters. load_in_8bit: Whether to load the model in 8-bit mode. load_in_4bit: Whether to load the model in 4-bit mode. + cpu_only: Whether to force using CPU only and disable quantization. Returns: The base model. """ from accelerate import Accelerator from transformers import BitsAndBytesConfig + import logging + logger = logging.getLogger(__name__) + + # Explicitly disable MPS when in CPU-only mode + if cpu_only: + import os + os.environ["PYTORCH_MPS_HIGH_WATERMARK_RATIO"] = "0.0" + os.environ["PYTORCH_ENABLE_MPS_FALLBACK"] = "0" + # Force PyTorch to not use MPS + torch._C._set_mps_enabled(False) if hasattr(torch._C, "_set_mps_enabled") else None + # Set default device to CPU explicitly + torch.set_default_device("cpu") + logger.warning("Disabled MPS device for CPU-only mode.") if use_accelerate: accelerator = Accelerator() device_map = {"": accelerator.process_index} else: - device_map = {"": torch.cuda.current_device()} - - bnb_config = BitsAndBytesConfig( - load_in_8bit=load_in_8bit, - load_in_4bit=load_in_4bit, - bnb_4bit_use_double_quant=True, - bnb_4bit_quant_type="nf4", - bnb_4bit_compute_dtype=torch.bfloat16, - ) + # Check for available devices and use the best one + if cpu_only: + device_map = {"": "cpu"} + elif torch.cuda.is_available(): + device_map = {"": torch.cuda.current_device()} + elif torch.backends.mps.is_available() and not cpu_only: + device_map = {"": "mps"} + else: + device_map = {"": "cpu"} + + # Only use BitsAndBytes config if CUDA is available and quantization is requested + # and we're not in CPU-only mode + if (load_in_8bit or load_in_4bit) and torch.cuda.is_available() and not cpu_only: + bnb_config = BitsAndBytesConfig( + load_in_8bit=load_in_8bit, + load_in_4bit=load_in_4bit, + bnb_4bit_use_double_quant=True, + bnb_4bit_quant_type="nf4", + bnb_4bit_compute_dtype=torch.bfloat16, + ) + else: + bnb_config = None + # Reset these flags if CUDA is not available or in CPU-only mode + load_in_8bit = False + load_in_4bit = False + + # Print device information for debugging + if should_print: + print(f"Loading model on device: {device_map}") + + # Use half precision for CPU to reduce memory usage if not in training + torch_dtype = torch.float16 if device_map[""] == "cpu" and not is_training else None + + # Check if it's a Phi model + is_phi_model = "phi" in base_model_id.lower() + + model_kwargs = { + "quantization_config": bnb_config, + "device_map": device_map, + "trust_remote_code": True, + "torch_dtype": torch_dtype, + # Use low_cpu_mem_usage for CPU training to minimize memory usage + "low_cpu_mem_usage": device_map[""] == "cpu", + } + + # Add special config for Phi models on CPU to avoid cache issues + if is_phi_model and (cpu_only or device_map[""] == "cpu"): + if should_print: + print("Loading Phi model on CPU with special configuration to avoid caching issues") + model_kwargs["use_flash_attention_2"] = False + # Set attn_implementation to eager for Phi models on CPU + model_kwargs["attn_implementation"] = "eager" model = AutoModelForCausalLM.from_pretrained( base_model_id, - quantization_config=bnb_config, - device_map=device_map, - trust_remote_code=True, + **model_kwargs ) + # For Phi models on CPU, disable kv cache feature to avoid errors + if is_phi_model and (cpu_only or device_map[""] == "cpu"): + if hasattr(model.config, "use_cache"): + model.config.use_cache = False + if should_print: + print("Disabled KV cache for Phi model on CPU to avoid errors") + if is_training: model.gradient_checkpointing_enable() - model = prepare_model_for_kbit_training(model) + + # For CPU-only mode, skip prepare_model_for_kbit_training if not using quantization + if not (cpu_only and not (load_in_8bit or load_in_4bit)): + model = prepare_model_for_kbit_training(model) config = LoraConfig( r=8, @@ -108,6 +174,7 @@ def load_pretrained_model( ft_model_dir: Path, load_in_4bit: bool = False, load_in_8bit: bool = False, + cpu_only: bool = False, ) -> AutoModelForCausalLM: """Load the finetuned model saved in the output directory. @@ -115,23 +182,76 @@ def load_pretrained_model( ft_model_dir: The path to the finetuned model directory. load_in_4bit: Whether to load the model in 4-bit mode. load_in_8bit: Whether to load the model in 8-bit mode. + cpu_only: Whether to force using CPU only and disable quantization. Returns: The finetuned model. """ from transformers import BitsAndBytesConfig + import logging + logger = logging.getLogger(__name__) - bnb_config = BitsAndBytesConfig( - load_in_8bit=load_in_8bit, - load_in_4bit=load_in_4bit, - bnb_4bit_use_double_quant=True, - bnb_4bit_quant_type="nf4", - bnb_4bit_compute_dtype=torch.bfloat16, - ) + # Explicitly disable MPS when in CPU-only mode + if cpu_only: + import os + os.environ["PYTORCH_MPS_HIGH_WATERMARK_RATIO"] = "0.0" + os.environ["PYTORCH_ENABLE_MPS_FALLBACK"] = "0" + # Force PyTorch to not use MPS + torch._C._set_mps_enabled(False) if hasattr(torch._C, "_set_mps_enabled") else None + # Set default device to CPU explicitly + torch.set_default_device("cpu") + logger.warning("Disabled MPS device for CPU-only mode.") + + # Set device map based on available hardware and settings + if cpu_only: + device_map = "cpu" + else: + device_map = "auto" + + # Only use BitsAndBytes config if quantization is requested and we're not in CPU-only mode + if (load_in_8bit or load_in_4bit) and not cpu_only and torch.cuda.is_available(): + bnb_config = BitsAndBytesConfig( + load_in_8bit=load_in_8bit, + load_in_4bit=load_in_4bit, + bnb_4bit_use_double_quant=True, + bnb_4bit_quant_type="nf4", + bnb_4bit_compute_dtype=torch.bfloat16, + ) + else: + bnb_config = None + + # Use half precision for CPU to reduce memory usage + torch_dtype = torch.float16 if device_map == "cpu" else None + + # Special config for Phi models on CPU to avoid cache issues + # Check if it's a Phi model + is_phi_model = "phi" in str(ft_model_dir).lower() + + model_kwargs = { + "quantization_config": bnb_config, + "device_map": device_map, + "trust_remote_code": True, + "torch_dtype": torch_dtype, + # Use low_cpu_mem_usage for CPU to minimize memory usage + "low_cpu_mem_usage": device_map == "cpu", + } + + # Add special config for Phi models on CPU to avoid cache issues + if is_phi_model and (cpu_only or device_map == "cpu"): + logger.warning("Loading Phi model on CPU with special configuration to avoid caching issues") + model_kwargs["use_flash_attention_2"] = False + # Set attn_implementation to eager for Phi models on CPU + model_kwargs["attn_implementation"] = "eager" + model = AutoModelForCausalLM.from_pretrained( ft_model_dir, - quantization_config=bnb_config, - device_map="auto", - trust_remote_code=True, + **model_kwargs ) + + # For Phi models on CPU, disable kv cache feature to avoid errors + if is_phi_model and (cpu_only or device_map == "cpu"): + if hasattr(model.config, "use_cache"): + model.config.use_cache = False + logger.warning("Disabled KV cache for Phi model on CPU to avoid errors") + return model diff --git a/gamesense/utils/tokenizer.py b/gamesense/utils/tokenizer.py index 6e92dfe34..66a55d785 100644 --- a/gamesense/utils/tokenizer.py +++ b/gamesense/utils/tokenizer.py @@ -17,6 +17,7 @@ from transformers import AutoTokenizer +import torch def load_tokenizer( @@ -113,9 +114,7 @@ def tokenize_for_eval( tokenizer: AutoTokenizer, system_prompt: str, ): - """Tokenizes the prompts for evaluation. - - This runs for the whole test dataset at once. + """Tokenize the data for evaluation. Args: data_points: The data points to tokenize. @@ -123,11 +122,10 @@ def tokenize_for_eval( system_prompt: The system prompt to use. Returns: - The tokenized prompt. + The tokenized data. """ eval_prompts = [ - f"""{system_prompt} - + f""" ### Target sentence: {data_point} @@ -135,6 +133,8 @@ def tokenize_for_eval( """ for data_point in data_points["target"] ] + # Use the available device instead of hardcoding "cuda" + device = "mps" if torch.backends.mps.is_available() else "cpu" return tokenizer(eval_prompts, padding="longest", return_tensors="pt").to( - "cuda" + device )