### What Are Domain-Specific Features?
Domain-specific features are variables or attributes engineered based on expert knowledge or insights about the problem’s domain. Unlike generic features (e.g., raw data like height, weight, or pixel values), these are tailored to capture patterns, relationships, or nuances that are particularly relevant to the task at hand. The idea is to transform raw data into something more meaningful for a machine learning model, often improving its ability to generalize and make accurate predictions.

For example:
- In a medical dataset, raw data might include "blood pressure" and "heart rate." A domain-specific feature could be "cardiovascular risk score," derived from combining these using a formula informed by medical expertise.
- In a text classification task, instead of just word counts, you might add a feature like "sentiment score" based on linguistic rules.

The impact? By embedding domain knowledge, the model gets a head start—it’s not just blindly searching for patterns but working with data that’s already structured to highlight what matters.

### How It Impacts Model Accuracy
1. **Before (Without Domain-Specific Features):**
   - The model relies solely on raw or basic features.
   - It might miss subtle but critical patterns unless the dataset is massive and the model is complex enough to learn them implicitly.
   - Accuracy can suffer due to noise, irrelevant features, or insufficient signal in the raw data.

2. **After (With Domain-Specific Features):**
   - The model leverages features that emphasize relevant relationships or reduce noise.
   - It often learns faster and performs better, especially with smaller datasets or simpler models.
   - Accuracy improves because the input data is more aligned with the underlying problem structure.

In [1]:
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error
import pandas as pd

# Set random seed for reproducibility
np.random.seed(42)

# Simulate a synthetic housing dataset
n_samples = 1000
data = {
    'square_footage': np.random.uniform(800, 4000, n_samples),  # House size in square feet
    'bedrooms': np.random.randint(1, 6, n_samples),             # Number of bedrooms
    'bathrooms': np.random.randint(1, 4, n_samples),           # Number of bathrooms
    'distance_to_city': np.random.uniform(1, 20, n_samples),   # Distance to city center (miles)
    'year_built': np.random.randint(1950, 2020, n_samples),    # Year the house was built
}

# Create a synthetic target variable (house price)
# Price depends on square footage, bedrooms, bathrooms, distance to city, and age
df = pd.DataFrame(data)
df['price'] = (
    200 * df['square_footage'] +
    20000 * df['bedrooms'] +
    30000 * df['bathrooms'] -
    5000 * df['distance_to_city'] -
    1000 * (2023 - df['year_built']) +  # Older houses are cheaper
    np.random.normal(0, 50000, n_samples)  # Add some noise
)

# Features (X) and target (y)
X = df.drop('price', axis=1).values
y = df['price'].values

# Before: Using only raw features
print("Before: Using raw features only")
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train and evaluate model with raw features
rf_before = RandomForestRegressor(random_state=42)
rf_before.fit(X_train, y_train)
y_pred_before = rf_before.predict(X_test)
mse_before = mean_squared_error(y_test, y_pred_before)
rmse_before = np.sqrt(mse_before)
print(f"RMSE with raw features: {rmse_before:.2f}\n")

# After: Adding domain-specific features
print("After: Adding domain-specific features")

# Create new features based on domain knowledge
def create_domain_features(X):
    df = pd.DataFrame(X, columns=['square_footage', 'bedrooms', 'bathrooms', 'distance_to_city', 'year_built'])
    
    # Domain-specific features
    df['price_per_sqft'] = 200  # Simulate a rough price per square foot (based on how price was generated)
    df['bedroom_to_bathroom_ratio'] = df['bedrooms'] / df['bathrooms']
    df['house_age'] = 2023 - df['year_built']  # Current year - year built
    df['space_per_room'] = df['square_footage'] / (df['bedrooms'] + df['bathrooms'])
    df['proximity_value'] = 1 / (df['distance_to_city'] + 1)  # Inverse distance (closer = more valuable)
    
    return df.values

# Create new feature matrix
X_enhanced = create_domain_features(X)
X_train_enh, X_test_enh, y_train, y_test = train_test_split(X_enhanced, y, test_size=0.2, random_state=42)

# Train and evaluate model with enhanced features
rf_after = RandomForestRegressor(random_state=42)
rf_after.fit(X_train_enh, y_train)
y_pred_after = rf_after.predict(X_test_enh)
mse_after = mean_squared_error(y_test, y_pred_after)
rmse_after = np.sqrt(mse_after)
print(f"RMSE with enhanced features: {rmse_after:.2f}")

# Print feature importance for the enhanced model
feature_names = ['square_footage', 'bedrooms', 'bathrooms', 'distance_to_city', 'year_built',
                 'price_per_sqft', 'bedroom_to_bathroom_ratio', 'house_age', 'space_per_room', 'proximity_value']
print("\nFeature Importances:")
for name, importance in zip(feature_names, rf_after.feature_importances_):
    print(f"{name}: {importance:.3f}")

Before: Using raw features only
RMSE with raw features: 52959.51

After: Adding domain-specific features
RMSE with enhanced features: 52918.23

Feature Importances:
square_footage: 0.905
bedrooms: 0.005
bathrooms: 0.005
distance_to_city: 0.015
year_built: 0.010
price_per_sqft: 0.000
bedroom_to_bathroom_ratio: 0.005
house_age: 0.009
space_per_room: 0.030
proximity_value: 0.016
