# Machine Learning Project Setup Guide

This notebook demonstrates how to set up a complete machine learning project structure with best practices for organization, dependency management, and version control.

## What You'll Learn:
- Creating a standardized ML project directory structure
- Setting up Python virtual environments
- Installing essential ML libraries
- Configuring version control with Git
- Setting up Jupyter notebook environments
- Organizing data directories properly
- Managing environment variables and secrets

## Prerequisites:
- Python 3.8+ installed on your system
- Git installed (optional but recommended)
- Basic command line knowledge

Let's get started! üöÄ

## 1. Create Project Directory Structure

A well-organized project structure is crucial for maintainable ML projects. Let's create a standardized folder layout that follows industry best practices.

In [None]:
import os
import pathlib
from pathlib import Path

# Define the project structure
project_structure = {
    'data': ['raw', 'processed', 'external'],
    'notebooks': [],
    'src': [],
    'models': [],
    'reports': ['figures'],
    'experiments': [],
    'tests': [],
    'configs': []
}

# Get current working directory (should be your project root)
project_root = Path.cwd().parent  # Since we're in notebooks folder
print(f"Project root: {project_root}")

# Create directory structure
for main_dir, subdirs in project_structure.items():
    main_path = project_root / main_dir
    main_path.mkdir(exist_ok=True)
    print(f"‚úì Created directory: {main_path}")
    
    for subdir in subdirs:
        sub_path = main_path / subdir
        sub_path.mkdir(exist_ok=True)
        print(f"  ‚úì Created subdirectory: {sub_path}")

print("\nüéâ Project directory structure created successfully!")

## 2. Set Up Virtual Environment

Virtual environments isolate your project dependencies and prevent conflicts between different projects. We'll create and activate a virtual environment for this ML project.

In [None]:
import sys
import subprocess
import venv

# Check if we're in a virtual environment
def is_venv():
    return hasattr(sys, 'real_prefix') or (
        hasattr(sys, 'base_prefix') and sys.base_prefix != sys.prefix
    )

print(f"Python executable: {sys.executable}")
print(f"Currently in virtual environment: {is_venv()}")

# If not in virtual environment, show instructions
if not is_venv():
    print("\n‚ö†Ô∏è  You're not in a virtual environment!")
    print("\nTo create and activate a virtual environment:")
    print("\n# Windows:")
    print("python -m venv ml_env")
    print("ml_env\\Scripts\\activate")
    print("\n# macOS/Linux:")
    print("python -m venv ml_env")
    print("source ml_env/bin/activate")
    print("\nThen restart this notebook!")
else:
    print("‚úÖ Great! You're already in a virtual environment.")
    print(f"Environment name: {Path(sys.executable).parent.parent.name}")

## 3. Install Essential ML Libraries

Let's install the core machine learning libraries that you'll need for most projects. We'll use pip to install packages and then verify the installation.

In [None]:
# Essential libraries for machine learning
essential_packages = [
    'numpy>=1.24.0',
    'pandas>=2.0.0',
    'matplotlib>=3.7.0',
    'seaborn>=0.12.0',
    'scikit-learn>=1.3.0',
    'jupyter>=1.0.0',
    'ipykernel>=6.25.0'
]

# Additional useful packages
additional_packages = [
    'plotly>=5.17.0',
    'xgboost>=2.0.0',
    'lightgbm>=4.0.0',
    'shap>=0.42.0',
    'tqdm>=4.65.0'
]

def install_packages(packages, description):
    print(f"\nüì¶ Installing {description}...")
    for package in packages:
        try:
            subprocess.check_call([sys.executable, '-m', 'pip', 'install', package])
            print(f"‚úÖ {package}")
        except subprocess.CalledProcessError as e:
            print(f"‚ùå Failed to install {package}: {e}")

# Uncomment the lines below to install packages
# Note: This might take a few minutes!

print("üîß Ready to install packages...")
print("Uncomment and run the installation commands below:")
print("# install_packages(essential_packages, 'essential ML packages')")
print("# install_packages(additional_packages, 'additional useful packages')")

# Uncomment these lines when you're ready to install:
# install_packages(essential_packages, 'essential ML packages')
# install_packages(additional_packages, 'additional useful packages')

## 4. Create Requirements File

A requirements.txt file ensures reproducibility by tracking all package dependencies and their versions. This makes it easy to recreate the same environment later.

In [None]:
# Create a comprehensive requirements.txt file
requirements_content = """# Core Data Science Libraries
numpy>=1.24.0
pandas>=2.0.0
matplotlib>=3.7.0
seaborn>=0.12.0
plotly>=5.17.0

# Machine Learning Libraries
scikit-learn>=1.3.0
xgboost>=2.0.0
lightgbm>=4.0.0
catboost>=1.2.0

# Deep Learning (optional)
# tensorflow>=2.13.0
# torch>=2.0.0
# torchvision>=0.15.0

# Jupyter and Development
jupyter>=1.0.0
jupyterlab>=4.0.0
ipykernel>=6.25.0
ipywidgets>=8.0.0

# Data Processing
scipy>=1.11.0
statsmodels>=0.14.0
openpyxl>=3.1.0

# Model Interpretation
shap>=0.42.0
lime>=0.2.0

# Utilities
tqdm>=4.65.0
joblib>=1.3.0
requests>=2.31.0

# Development Tools
black>=23.0.0
flake8>=6.0.0
pytest>=7.4.0

# Optional but useful
# optuna>=3.3.0  # Hyperparameter optimization
# mlflow>=2.5.0  # ML experiment tracking
"""

# Write requirements.txt file
requirements_path = project_root / "requirements.txt"
with open(requirements_path, 'w') as f:
    f.write(requirements_content)

print(f"‚úÖ Created requirements.txt at: {requirements_path}")
print("\nTo install all requirements later, run:")
print("pip install -r requirements.txt")

# Also generate current environment requirements
try:
    result = subprocess.run([sys.executable, '-m', 'pip', 'freeze'], 
                          capture_output=True, text=True)
    current_reqs_path = project_root / "requirements_current.txt"
    with open(current_reqs_path, 'w') as f:
        f.write(result.stdout)
    print(f"‚úÖ Created current environment snapshot: {current_reqs_path}")
except Exception as e:
    print(f"‚ö†Ô∏è  Could not create current requirements: {e}")

## 5. Initialize Git Repository

Version control is essential for tracking changes in your ML projects. Let's set up Git with a proper .gitignore file for Python/ML projects.

In [None]:
# Check if git is available and initialize repository
def check_git():
    try:
        subprocess.run(['git', '--version'], capture_output=True, check=True)
        return True
    except (subprocess.CalledProcessError, FileNotFoundError):
        return False

# Create .gitignore file for ML projects
gitignore_content = """# Python
__pycache__/
*.py[cod]
*$py.class
*.so
.Python
build/
develop-eggs/
dist/
downloads/
eggs/
.eggs/
lib/
lib64/
parts/
sdist/
var/
wheels/
*.egg-info/
.installed.cfg
*.egg

# Virtual Environment
ml_env/
venv/
env/
ENV/

# Jupyter Notebook
.ipynb_checkpoints
*.ipynb_checkpoints/

# Data files (add specific patterns as needed)
data/raw/*
data/external/*
!data/raw/.gitkeep
!data/external/.gitkeep

# Model files
*.joblib
*.pkl
*.h5
*.pb
*.pth
models/*.pkl
models/*.joblib

# Large files
*.zip
*.tar.gz
*.rar

# IDE
.vscode/
.idea/
*.swp
*.swo

# OS
.DS_Store
Thumbs.db

# MLflow
mlruns/
mlartifacts/

# Environment variables
.env
.env.local

# Temporary files
*.tmp
temp/
tmp/
"""

# Write .gitignore file
gitignore_path = project_root / ".gitignore"
with open(gitignore_path, 'w') as f:
    f.write(gitignore_content)

print(f"‚úÖ Created .gitignore at: {gitignore_path}")

# Initialize git repository if git is available
if check_git():
    try:
        # Check if already a git repo
        git_dir = project_root / ".git"
        if git_dir.exists():
            print("‚úÖ Git repository already exists!")
        else:
            subprocess.run(['git', 'init'], cwd=project_root, check=True)
            print("‚úÖ Initialized Git repository!")
            
        print("\nNext steps for Git:")
        print("1. git add .")
        print("2. git commit -m 'Initial project setup'")
        print("3. git remote add origin <your-repo-url>")
        print("4. git push -u origin main")
        
    except subprocess.CalledProcessError as e:
        print(f"‚ö†Ô∏è  Error with Git: {e}")
else:
    print("‚ö†Ô∏è  Git not found. Install Git to enable version control.")

## 6. Set Up Jupyter Notebook Configuration

Let's configure Jupyter notebooks for an optimal machine learning development experience, including kernel setup and useful extensions.

In [None]:
# Install and configure Jupyter kernel for this environment
def setup_jupyter_kernel():
    try:
        # Install ipykernel if not already installed
        subprocess.check_call([sys.executable, '-m', 'pip', 'install', 'ipykernel'])
        
        # Add kernel to Jupyter
        kernel_name = f"ml_env_{project_root.name}"
        subprocess.check_call([
            sys.executable, '-m', 'ipykernel', 'install', 
            '--user', '--name', kernel_name, '--display-name', f"Python (ML - {project_root.name})"
        ])
        
        print(f"‚úÖ Added Jupyter kernel: {kernel_name}")
        print("You can now select this kernel in Jupyter notebooks!")
        
    except subprocess.CalledProcessError as e:
        print(f"‚ö†Ô∏è  Error setting up Jupyter kernel: {e}")

# Create Jupyter config directory and basic configuration
jupyter_config = {
    "NotebookApp": {
        "notebook_dir": str(project_root),
        "open_browser": True,
        "port": 8888
    },
    "InteractiveShell": {
        "ast_node_interactivity": "all"
    }
}

print("üîß Jupyter Configuration:")
print("- Notebook directory:", project_root)
print("- Auto-open browser: True")
print("- Default port: 8888")
print("- Show all output: True")

# Set up the kernel (uncomment when ready)
print("\nüìã To set up Jupyter kernel, uncomment and run:")
print("# setup_jupyter_kernel()")

# Useful Jupyter magic commands to remember
print("\n‚ú® Useful Jupyter Magic Commands:")
magic_commands = [
    "%matplotlib inline  # Display plots inline",
    "%load_ext autoreload  # Auto-reload modules",
    "%autoreload 2  # Reload all modules before executing",
    "%%time  # Time execution of cell",
    "%debug  # Enter debugger after exception",
    "%who  # List all variables",
    "%whos  # List variables with details"
]

for cmd in magic_commands:
    print(f"  {cmd}")

# Uncomment to set up the kernel:
# setup_jupyter_kernel()

## 7. Create Data Directories

Proper data organization is crucial for ML projects. Let's set up a structured approach to managing different types of data with clear naming conventions.

In [None]:
# Create data directory structure with documentation
data_structure = {
    'raw': {
        'description': 'Original, immutable data files',
        'guidelines': [
            'Never modify files in this directory',
            'Document data sources and collection methods',
            'Use descriptive filenames with dates if applicable'
        ]
    },
    'processed': {
        'description': 'Cleaned and preprocessed data files',
        'guidelines': [
            'Save intermediate processing steps',
            'Include data validation and quality checks',
            'Document transformations applied'
        ]
    },
    'external': {
        'description': 'External datasets and reference files',
        'guidelines': [
            'Store downloaded datasets from public sources',
            'Include dataset documentation and licenses',
            'Maintain original file formats when possible'
        ]
    }
}

# Create data directories with README files
data_dir = project_root / 'data'
for subdir, info in data_structure.items():
    subdir_path = data_dir / subdir
    subdir_path.mkdir(exist_ok=True)
    
    # Create README for each data subdirectory
    readme_content = f"# {subdir.title()} Data\n\n"
    readme_content += f"{info['description']}\n\n"
    readme_content += "## Guidelines:\n"
    for guideline in info['guidelines']:
        readme_content += f"- {guideline}\n"
    
    if subdir == 'raw':
        readme_content += "\n## Data Sources:\n"
        readme_content += "- Add your data source information here\n"
        readme_content += "- Include URLs, APIs, or database connections\n"
        readme_content += "- Note any access requirements or credentials needed\n"
    elif subdir == 'processed':
        readme_content += "\n## Naming Convention:\n"
        readme_content += "- Use descriptive names: `cleaned_dataset_v1.csv`\n"
        readme_content += "- Include processing date: `processed_2024-01-15.parquet`\n"
        readme_content += "- Version your datasets: `features_v2.pkl`\n"
    elif subdir == 'external':
        readme_content += "\n## Common Sources:\n"
        readme_content += "- Kaggle datasets\n"
        readme_content += "- UCI ML Repository\n"
        readme_content += "- Government open data\n"
        readme_content += "- Academic datasets\n"
    
    readme_path = subdir_path / 'README.md'
    with open(readme_path, 'w') as f:
        f.write(readme_content)
    
    print(f"‚úÖ Created {subdir} directory with documentation")

print(f"\nüìÅ Data directory structure:")
print(f"  {data_dir}/")
for subdir in data_structure.keys():
    print(f"    {subdir}/")
    print(f"      README.md")
print("\nüéØ Your data is now organized and documented!")

## 8. Set Up Environment Variables

Environment variables help manage sensitive information like API keys, database connections, and configuration settings securely.

In [None]:
# Create .env template file for environment variables
env_template = """# Environment Variables Template
# Copy this file to .env and fill in your actual values
# Never commit .env to version control!

# API Keys
OPENAI_API_KEY=your_openai_api_key_here
HUGGINGFACE_API_KEY=your_huggingface_token_here
KAGGLE_USERNAME=your_kaggle_username
KAGGLE_KEY=your_kaggle_api_key

# Database Connections
DATABASE_URL=postgresql://user:password@localhost:5432/dbname
MONGODB_URI=mongodb://localhost:27017/your_database

# AWS Credentials (if using AWS)
AWS_ACCESS_KEY_ID=your_aws_access_key
AWS_SECRET_ACCESS_KEY=your_aws_secret_key
AWS_DEFAULT_REGION=us-west-2

# MLflow Tracking
MLFLOW_TRACKING_URI=http://localhost:5000
MLFLOW_EXPERIMENT_NAME=default

# Project Settings
PROJECT_NAME=ML_Project
ENVIRONMENT=development
DEBUG=True

# Data Sources
DATA_API_BASE_URL=https://api.example.com/v1
DATA_API_KEY=your_data_api_key

# Model Registry
MODEL_REGISTRY_URI=s3://your-model-bucket/models
"""

# Create .env.template file
env_template_path = project_root / ".env.template"
with open(env_template_path, 'w') as f:
    f.write(env_template)

print(f"‚úÖ Created environment template: {env_template_path}")

# Create a sample .env file (with placeholder values)
env_path = project_root / ".env"
if not env_path.exists():
    with open(env_path, 'w') as f:
        f.write("# Your actual environment variables\n")
        f.write("# Copy from .env.template and fill in real values\n\n")
    print(f"‚úÖ Created sample .env file: {env_path}")
else:
    print(f"üìÅ .env file already exists: {env_path}")

# Create Python module for loading environment variables
env_loader_code = '''"""
Environment variable loader for ML projects.
"""
import os
from pathlib import Path
from dotenv import load_dotenv

def load_environment_variables(env_file=None):
    """
    Load environment variables from .env file.
    
    Args:
        env_file (str, optional): Path to .env file. 
                                 If None, looks for .env in project root.
    
    Returns:
        dict: Dictionary of loaded environment variables
    """
    if env_file is None:
        # Find project root (assumes this file is in src/)
        current_dir = Path(__file__).parent
        project_root = current_dir.parent
        env_file = project_root / ".env"
    
    if Path(env_file).exists():
        load_dotenv(env_file)
        print(f"‚úÖ Loaded environment variables from {env_file}")
    else:
        print(f"‚ö†Ô∏è  Environment file not found: {env_file}")
    
    # Return commonly used variables
    return {
        'project_name': os.getenv('PROJECT_NAME', 'ML_Project'),
        'environment': os.getenv('ENVIRONMENT', 'development'),
        'debug': os.getenv('DEBUG', 'False').lower() == 'true',
        'mlflow_uri': os.getenv('MLFLOW_TRACKING_URI'),
        'api_keys': {
            'openai': os.getenv('OPENAI_API_KEY'),
            'huggingface': os.getenv('HUGGINGFACE_API_KEY'),
            'kaggle_username': os.getenv('KAGGLE_USERNAME'),
            'kaggle_key': os.getenv('KAGGLE_KEY'),
        }
    }

# Example usage:
# from src.env_config import load_environment_variables
# config = load_environment_variables()
# print(f"Project: {config['project_name']}")
'''

# Create environment configuration module
env_config_path = project_root / "src" / "env_config.py"
with open(env_config_path, 'w') as f:
    f.write(env_loader_code)

print(f"‚úÖ Created environment config module: {env_config_path}")

print("\nüîê Environment Variables Setup Complete!")
print("\nNext steps:")
print("1. Copy .env.template to .env")
print("2. Fill in your actual API keys and credentials in .env")
print("3. Install python-dotenv: pip install python-dotenv")
print("4. Use: from src.env_config import load_environment_variables")

print("\n‚ö†Ô∏è  Security Reminder:")
print("- Never commit .env files to version control")
print("- Use strong, unique API keys")
print("- Rotate keys regularly")
print("- Use different keys for development and production")

## üéâ Setup Complete!

Congratulations! You've successfully set up a comprehensive machine learning workspace. Here's what we've accomplished:

### ‚úÖ What's Been Created:

1. **Project Structure**: Organized folders for data, notebooks, models, and source code
2. **Virtual Environment**: Isolated Python environment for your ML projects
3. **Dependencies**: Comprehensive requirements.txt with essential ML libraries
4. **Version Control**: Git repository with ML-optimized .gitignore
5. **Jupyter Setup**: Configured notebooks with proper kernel integration
6. **Data Organization**: Structured data directories with documentation
7. **Environment Variables**: Secure configuration management
8. **Utility Modules**: Reusable code for common ML tasks

### üöÄ Next Steps:

1. **Install Dependencies**: Run the package installation cells above
2. **Start Coding**: Create your first ML notebook in the `notebooks/` folder
3. **Add Data**: Place datasets in the appropriate `data/` subdirectories
4. **Track Experiments**: Use the `experiments/` folder for model iterations
5. **Build Reusable Code**: Add functions to the `src/` folder
6. **Version Control**: Commit your changes to Git

### üìö Learning Path Suggestions:

- **Beginner**: Start with data exploration and basic sklearn models
- **Intermediate**: Experiment with feature engineering and model evaluation
- **Advanced**: Implement deep learning models and MLOps practices

### üõ†Ô∏è Useful Commands:

```bash
# Activate environment
ml_env\\Scripts\\activate  # Windows
source ml_env/bin/activate  # macOS/Linux

# Install packages
pip install -r requirements.txt

# Start Jupyter
jupyter lab

# Git commands
git add .
git commit -m "Your message"
git push
```

Happy Machine Learning! ü§ñüìä

In [None]:
# Install and configure Jupyter kernel for this environment
def setup_jupyter_kernel():
    try:
        # Install ipykernel if not already installed
        subprocess.check_call([sys.executable, '-m', 'pip', 'install', 'ipykernel'])
        
        # Add kernel to Jupyter
        kernel_name = f"ml_env_{project_root.name}"
        subprocess.check_call([
            sys.executable, '-m', 'ipykernel', 'install', 
            '--user', '--name', kernel_name, '--display-name', f"Python (ML - {project_root.name})"
        ])
        
        print(f"‚úÖ Added Jupyter kernel: {kernel_name}")
        print("You can now select this kernel in Jupyter notebooks!")
        
    except subprocess.CalledProcessError as e:
        print(f"‚ö†Ô∏è  Error setting up Jupyter kernel: {e}")

# Create Jupyter config directory and basic configuration
jupyter_config = {
    "NotebookApp": {
        "notebook_dir": str(project_root),
        "open_browser": True,
        "port": 8888
    },
    "InteractiveShell": {
        "ast_node_interactivity": "all"
    }
}

print("üîß Jupyter Configuration:")
print("- Notebook directory:", project_root)
print("- Auto-open browser: True")
print("- Default port: 8888")
print("- Show all output: True")

# Set up the kernel (uncomment when ready)
print("\nüìã To set up Jupyter kernel, uncomment and run:")
print("# setup_jupyter_kernel()")

# Useful Jupyter magic commands to remember
print("\n‚ú® Useful Jupyter Magic Commands:")
magic_commands = [
    "%matplotlib inline  # Display plots inline",
    "%load_ext autoreload  # Auto-reload modules",
    "%autoreload 2  # Reload all modules before executing",
    "%%time  # Time execution of cell",
    "%debug  # Enter debugger after exception",
    "%who  # List all variables",
    "%whos  # List variables with details"
]

for cmd in magic_commands:
    print(f"  {cmd}")

# Uncomment to set up the kernel:
# setup_jupyter_kernel()

## 7. Create Data Directories with Documentation

Proper data organization is crucial for ML projects. Let's create data directories with clear documentation and naming conventions.

In [None]:
# Create documentation for data directories
data_docs = {
    'raw': """# Raw Data Directory

Store original, immutable datasets here.

## Guidelines:
- Never modify files in this directory
- Document data sources and collection methods
- Use descriptive filenames with dates: `sales_data_2024-01-15.csv`
- Include metadata files: `dataset_description.txt`

## Naming Convention:
- Use lowercase with underscores: `customer_data.csv`
- Include date stamps: `web_logs_2024_01.json`
- Add version numbers if needed: `model_features_v2.parquet`

## Data Sources:
- Add your data source information here
- Include URLs, APIs, or database connections
- Note any access requirements or credentials needed
""",
    
    'processed': """# Processed Data Directory

Store cleaned and preprocessed datasets here.

## Guidelines:
- Save intermediate processing steps
- Include data validation and quality checks
- Document transformations applied
- Keep processing scripts in src/ directory

## Naming Convention:
- Prefix with processing stage: `01_cleaned_data.csv`
- Include transformation type: `normalized_features.pkl`
- Version your datasets: `final_dataset_v3.parquet`

## Processing Pipeline:
1. Raw data ‚Üí Cleaned data (remove nulls, fix types)
2. Cleaned ‚Üí Features (feature engineering)
3. Features ‚Üí Model-ready (scaled, encoded)
""",
    
    'external': """# External Data Directory

Store external datasets and reference files here.

## Guidelines:
- Downloaded datasets from public sources
- API responses and web scraping results
- Reference datasets for benchmarking
- Include licenses and attribution

## Common Sources:
- Kaggle competitions and datasets
- UCI Machine Learning Repository
- Government open data portals
- Academic datasets and papers
- Company APIs and databases

## Documentation:
- Always include source URLs
- Note download dates
- Include any terms of use or licenses
"""
}

# Create data directories with documentation
data_dir = project_root / "data"
for subdir_name, doc_content in data_docs.items():
    subdir = data_dir / subdir_name
    subdir.mkdir(exist_ok=True)
    
    # Create README file for each directory
    readme_path = subdir / "README.md"
    with open(readme_path, 'w') as f:
        f.write(doc_content)
    
    print(f"‚úÖ Created {subdir_name}/ directory with documentation")

# Create a data manifest template
manifest_content = """# Data Manifest

Track all datasets used in this project.

| Dataset | Source | Date Added | Size | Description | Location |
|---------|--------|------------|------|-------------|----------|
| Example Dataset | Kaggle | 2024-01-15 | 10MB | Customer data | data/raw/customers.csv |

## Data Dictionary

| Column | Type | Description | Example |
|--------|------|-------------|---------|
| customer_id | int | Unique identifier | 12345 |
| age | int | Customer age | 25 |
| city | str | Customer city | New York |

## Data Quality Notes

- Missing values: Handle with mean imputation for numerical, mode for categorical
- Outliers: Found in 'income' column, investigate further
- Duplicates: None found
- Data types: All correct after preprocessing

## Processing Notes

1. `customers_raw.csv` ‚Üí `customers_cleaned.csv`: Removed nulls, fixed data types
2. `customers_cleaned.csv` ‚Üí `customers_features.csv`: Added derived features
3. `customers_features.csv` ‚Üí `customers_final.csv`: Scaled and encoded for modeling
"""

manifest_path = data_dir / "DATA_MANIFEST.md"
with open(manifest_path, 'w') as f:
    f.write(manifest_content)

print(f"‚úÖ Created data manifest template: {manifest_path}")
print("\nüìä Data organization complete!")
print("Remember to update the DATA_MANIFEST.md as you add new datasets.")

## 8. Set Up Environment Variables

Environment variables help manage sensitive information like API keys, database connections, and configuration settings securely.

In [None]:
# Create .env template file
env_template = """# Environment Variables for ML Project
# Copy this file to .env and fill in your actual values
# NEVER commit .env files to version control!

# Database Configuration
DATABASE_URL=postgresql://username:password@localhost:5432/dbname
DATABASE_HOST=localhost
DATABASE_PORT=5432
DATABASE_NAME=ml_project_db
DATABASE_USER=your_username
DATABASE_PASSWORD=your_password

# API Keys
OPENAI_API_KEY=your_openai_api_key_here
HUGGINGFACE_API_KEY=your_huggingface_key_here
WANDB_API_KEY=your_wandb_key_here

# Cloud Storage (AWS)
AWS_ACCESS_KEY_ID=your_aws_access_key
AWS_SECRET_ACCESS_KEY=your_aws_secret_key
AWS_REGION=us-west-2
S3_BUCKET_NAME=your_ml_data_bucket

# Cloud Storage (GCP)
GOOGLE_APPLICATION_CREDENTIALS=path/to/service-account-key.json
GCS_BUCKET_NAME=your_gcs_bucket

# MLflow Configuration
MLFLOW_TRACKING_URI=http://localhost:5000
MLFLOW_EXPERIMENT_NAME=default

# Model Configuration
MODEL_VERSION=v1.0.0
MODEL_PATH=models/production/
RANDOM_SEED=42

# Development Settings
DEBUG=True
LOG_LEVEL=INFO
"""

# Create .env.template file
env_template_path = project_root / ".env.template"
with open(env_template_path, 'w') as f:
    f.write(env_template)

print(f"‚úÖ Created environment template: {env_template_path}")

# Create config.py for loading environment variables
config_content = '''"""
Configuration module for loading environment variables.
"""

import os
from pathlib import Path
from dotenv import load_dotenv

# Load environment variables from .env file
env_path = Path(__file__).parent / ".env"
load_dotenv(env_path)

class Config:
    """Configuration class for ML project."""
    
    # Database settings
    DATABASE_URL = os.getenv("DATABASE_URL")
    DATABASE_HOST = os.getenv("DATABASE_HOST", "localhost")
    DATABASE_PORT = int(os.getenv("DATABASE_PORT", 5432))
    DATABASE_NAME = os.getenv("DATABASE_NAME")
    DATABASE_USER = os.getenv("DATABASE_USER")
    DATABASE_PASSWORD = os.getenv("DATABASE_PASSWORD")
    
    # API Keys
    OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")
    HUGGINGFACE_API_KEY = os.getenv("HUGGINGFACE_API_KEY")
    WANDB_API_KEY = os.getenv("WANDB_API_KEY")
    
    # AWS Configuration
    AWS_ACCESS_KEY_ID = os.getenv("AWS_ACCESS_KEY_ID")
    AWS_SECRET_ACCESS_KEY = os.getenv("AWS_SECRET_ACCESS_KEY")
    AWS_REGION = os.getenv("AWS_REGION", "us-west-2")
    S3_BUCKET_NAME = os.getenv("S3_BUCKET_NAME")
    
    # GCP Configuration
    GOOGLE_APPLICATION_CREDENTIALS = os.getenv("GOOGLE_APPLICATION_CREDENTIALS")
    GCS_BUCKET_NAME = os.getenv("GCS_BUCKET_NAME")
    
    # MLflow Configuration
    MLFLOW_TRACKING_URI = os.getenv("MLFLOW_TRACKING_URI", "http://localhost:5000")
    MLFLOW_EXPERIMENT_NAME = os.getenv("MLFLOW_EXPERIMENT_NAME", "default")
    
    # Model Configuration
    MODEL_VERSION = os.getenv("MODEL_VERSION", "v1.0.0")
    MODEL_PATH = os.getenv("MODEL_PATH", "models/production/")
    RANDOM_SEED = int(os.getenv("RANDOM_SEED", 42))
    
    # Development Settings
    DEBUG = os.getenv("DEBUG", "False").lower() == "true"
    LOG_LEVEL = os.getenv("LOG_LEVEL", "INFO")

# Example usage:
# from config import Config
# print(Config.DATABASE_URL)
'''

config_path = project_root / "src" / "config.py"
with open(config_path, 'w') as f:
    f.write(config_content)

print(f"‚úÖ Created configuration module: {config_path}")

# Instructions for using environment variables
print("\nüîê Environment Variables Setup Complete!")
print("\nNext steps:")
print("1. Copy .env.template to .env: cp .env.template .env")
print("2. Edit .env file with your actual values")
print("3. Install python-dotenv: pip install python-dotenv")
print("4. Import config in your code: from src.config import Config")
print("\n‚ö†Ô∏è  IMPORTANT: Never commit .env files to version control!")
print("The .env file is already in .gitignore for your protection.")