# ðŸ”§ Feature Engineering Pipeline

This notebook creates a preprocessing pipeline to:
1. Encode categorical features (ordinal and one-hot encoding)
2. Scale numerical features
3. Prepare data for model training

## Table of Contents
1. [Setup & Data Loading](#1-setup--data-loading)
2. [Feature Definitions](#2-feature-definitions)
3. [Preprocessing Pipeline](#3-preprocessing-pipeline)
4. [Apply Pipeline & Save Processed Data](#4-apply-pipeline--save-processed-data)

## 1. Setup & Data Loading

In [57]:
# Import libraries
import pandas as pd
import numpy as np
from pathlib import Path
import joblib

from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import (
    StandardScaler,
    OneHotEncoder,
    OrdinalEncoder,
    FunctionTransformer
)
from sklearn.model_selection import train_test_split

# Paths
DATA_DIR = Path('../data/playground-series-s6e1')
OUTPUT_DIR = Path('../data/processed')
OUTPUT_DIR.mkdir(parents=True, exist_ok=True)

print("âœ… Libraries loaded successfully!")

âœ… Libraries loaded successfully!


In [58]:
# Load datasets
train = pd.read_csv(DATA_DIR / 'train.csv')
test = pd.read_csv(DATA_DIR / 'test.csv')

print(f"Training set shape: {train.shape}")
print(f"Test set shape: {test.shape}")
print(f"\nColumns: {train.columns.tolist()}")

Training set shape: (630000, 13)
Test set shape: (270000, 12)

Columns: ['id', 'age', 'gender', 'course', 'study_hours', 'class_attendance', 'internet_access', 'sleep_hours', 'sleep_quality', 'study_method', 'facility_rating', 'exam_difficulty', 'exam_score']


## 2. Feature Definitions

In [59]:
# Define feature groups
TARGET = 'exam_score'
ID_COL = 'id'

# Numerical features (continuous)
NUMERICAL_FEATURES = ['age', 'study_hours', 'class_attendance', 'sleep_hours']

# Ordinal categorical features (have natural ordering)
ORDINAL_FEATURES = {
    'sleep_quality': ['poor', 'average', 'good'],
    'facility_rating': ['low', 'medium', 'high'],
    'exam_difficulty': ['easy', 'moderate', 'hard']
}

# Nominal categorical features (no natural ordering) - use one-hot encoding
NOMINAL_FEATURES = ['gender', 'course', 'internet_access', 'study_method']

print("Feature Groups:")
print(f"  Numerical ({len(NUMERICAL_FEATURES)}): {NUMERICAL_FEATURES}")
print(f"  Ordinal ({len(ORDINAL_FEATURES)}): {list(ORDINAL_FEATURES.keys())}")
print(f"  Nominal ({len(NOMINAL_FEATURES)}): {NOMINAL_FEATURES}")

Feature Groups:
  Numerical (4): ['age', 'study_hours', 'class_attendance', 'sleep_hours']
  Ordinal (3): ['sleep_quality', 'facility_rating', 'exam_difficulty']
  Nominal (4): ['gender', 'course', 'internet_access', 'study_method']


In [60]:
# Prepare features and target
X_train = train.drop(columns=[ID_COL, TARGET])
y_train = train[TARGET]
X_test = test.drop(columns=[ID_COL])
test_ids = test[ID_COL]

print(f"X_train shape: {X_train.shape}")
print(f"y_train shape: {y_train.shape}")
print(f"X_test shape: {X_test.shape}")

X_train shape: (630000, 11)
y_train shape: (630000,)
X_test shape: (270000, 11)


### Feature Interactions: Sleep Features

Creating interaction features between `sleep_hours` and `sleep_quality` to capture the combined effect of sleep duration and quality on exam performance.

In [61]:
# Create sleep interaction features
def create_sleep_features(df):
    """
    Create interaction features between sleep_hours and sleep_quality.
    """
    df = df.copy()
    
    # Map sleep_quality to numeric (for interactions)
    sleep_quality_map = {'poor': 0, 'average': 1, 'good': 2}
    sleep_quality_numeric = df['sleep_quality'].map(sleep_quality_map)
    
    # 1. Sleep Score: sleep_hours * sleep_quality (weighted interaction)
    #    Higher hours + better quality = higher score
    df['sleep_score'] = df['sleep_hours'] * (sleep_quality_numeric + 1)
    
    # 2. Optimal Sleep Flag: 7-9 hours with good/average quality
    df['optimal_sleep'] = (
        (df['sleep_hours'] >= 7) & 
        (df['sleep_hours'] <= 9) & 
        (df['sleep_quality'].isin(['good', 'average']))
    ).astype(int)
    
    # 3. Sleep Deficit: Poor quality or < 6 hours
    df['sleep_deficit'] = (
        (df['sleep_hours'] < 6) | 
        (df['sleep_quality'] == 'poor')
    ).astype(int)
    
    # 4. Effective Study Score
    df['effective_study_score'] = df['study_hours'] * (sleep_quality_numeric + 1)
    
    return df

# Apply to train and test
X_train = create_sleep_features(X_train)
X_test = create_sleep_features(X_test)

# Add new features to numerical list for preprocessing
SLEEP_INTERACTION_FEATURES = ['sleep_score', 'optimal_sleep', 'sleep_deficit', 'effective_study_score']
NUMERICAL_FEATURES = NUMERICAL_FEATURES + SLEEP_INTERACTION_FEATURES

print("âœ… Sleep interaction features created!")
print(f"\nNew features: {SLEEP_INTERACTION_FEATURES}")
print(f"\nUpdated numerical features ({len(NUMERICAL_FEATURES)}): {NUMERICAL_FEATURES}")

âœ… Sleep interaction features created!

New features: ['sleep_score', 'optimal_sleep', 'sleep_deficit', 'effective_study_score']

Updated numerical features (8): ['age', 'study_hours', 'class_attendance', 'sleep_hours', 'sleep_score', 'optimal_sleep', 'sleep_deficit', 'effective_study_score']


In [62]:
# Verify the new features
print("Sleep Interaction Features Summary:")
print("=" * 50)
print(X_train[SLEEP_INTERACTION_FEATURES].describe().round(2))

print("\n\nSample of sleep features by sleep_quality:")
print(X_train.groupby('sleep_quality')[SLEEP_INTERACTION_FEATURES].mean().round(2))

Sleep Interaction Features Summary:
       sleep_score  optimal_sleep  sleep_deficit  effective_study_score
count    630000.00      630000.00      630000.00              630000.00
mean         14.18           0.23           0.54                   8.10
std           7.01           0.42           0.50                   6.21
min           4.10           0.00           0.00                   0.08
25%           8.30           0.00           0.00                   3.06
50%          13.20           0.00           1.00                   6.55
75%          18.80           0.00           1.00                  12.38
max          29.70           1.00           1.00                  23.73


Sample of sleep features by sleep_quality:
               sleep_score  optimal_sleep  sleep_deficit  \
sleep_quality                                              
average              14.17           0.35           0.31   
good                 21.37           0.36           0.30   
poor                  7.01     

### Feature Interactions: In-School Features

In [63]:
def create_school_features(df):
    df = df.copy()
    
    # Map sleep_quality to numeric (for interactions)
    facility_rating_map = {'low': 1, 'medium': 2, 'high': 3}
    facility_rating_numeric = df['facility_rating'].map(facility_rating_map)
    
    # 1. Sleep Score: sleep_hours * sleep_quality (weighted interaction)
    #    Higher hours + better quality = higher score
    df['class_time_quality'] = df['class_attendance'] * (facility_rating_numeric)
    
    return df

# Apply to train and test
X_train = create_school_features(X_train)
X_test = create_school_features(X_test)

# Add new features to numerical list for preprocessing
SCHOOL_INTERACTION_FEATURES = ['class_time_quality']
NUMERICAL_FEATURES = NUMERICAL_FEATURES + SCHOOL_INTERACTION_FEATURES

print("âœ… School interaction features created!")
print(f"\nNew features: {SCHOOL_INTERACTION_FEATURES}")
print(f"\nUpdated numerical features ({len(NUMERICAL_FEATURES)}): {NUMERICAL_FEATURES}")

âœ… School interaction features created!

New features: ['class_time_quality']

Updated numerical features (9): ['age', 'study_hours', 'class_attendance', 'sleep_hours', 'sleep_score', 'optimal_sleep', 'sleep_deficit', 'effective_study_score', 'class_time_quality']


In [64]:
# Verify the new features
print("School Interaction Features Summary:")
print("=" * 50)
print(X_train[SCHOOL_INTERACTION_FEATURES].describe().round(2))

School Interaction Features Summary:
       class_time_quality
count           630000.00
mean               143.39
std                 70.25
min                 40.60
25%                 84.20
50%                136.00
75%                190.80
max                298.20


### Polynomial Features: Squared Terms

Creating squared (exponential) features to capture non-linear relationships between continuous variables and exam scores.

In [65]:
def create_polynomial_features(df):
    """
    Create squared features for key continuous variables with high correlation to target.
    """
    df = df.copy()
    
    # Features to square (chosen for high correlation or meaningful non-linearity)
    features_to_square = [
        'sleep_hours',          # Non-linear relationship with performance
        'study_hours',          # Diminishing returns after certain point
        'class_attendance',     # High attendance squared amplifies importance
        'sleep_score',          # Interaction feature
        'effective_study_score', # Interaction feature
        'class_time_quality'    # Interaction feature
    ]
    
    # Create squared features
    for feature in features_to_square:
        df[f'{feature}_squared'] = df[feature] ** 2
    
    return df

# Apply to train and test
X_train = create_polynomial_features(X_train)
X_test = create_polynomial_features(X_test)

# Add new features to numerical list for preprocessing
POLYNOMIAL_FEATURES = [
    'sleep_hours_squared',
    'study_hours_squared',
    'class_attendance_squared',
    'sleep_score_squared',
    'effective_study_score_squared',
    'class_time_quality_squared'
]
NUMERICAL_FEATURES = NUMERICAL_FEATURES + POLYNOMIAL_FEATURES

print("âœ… Polynomial features created!")
print(f"\nNew squared features: {POLYNOMIAL_FEATURES}")
print(f"\nTotal numerical features: {len(NUMERICAL_FEATURES)}")

âœ… Polynomial features created!

New squared features: ['sleep_hours_squared', 'study_hours_squared', 'class_attendance_squared', 'sleep_score_squared', 'effective_study_score_squared', 'class_time_quality_squared']

Total numerical features: 15


In [66]:
# Verify the polynomial features
print("Polynomial Features Summary:")
print("=" * 50)
print(X_train[POLYNOMIAL_FEATURES].describe().round(2))

print("\n\nComparison of original vs squared features:")
comparison_cols = ['study_hours', 'study_hours_squared', 'sleep_hours', 'sleep_hours_squared']
print(X_train[comparison_cols].head(10).round(2))

Polynomial Features Summary:
       sleep_hours_squared  study_hours_squared  class_attendance_squared  \
count            630000.00            630000.00                 630000.00   
mean                 53.07                21.59                   5485.97   
std                  24.71                19.57                   2495.25   
min                  16.81                 0.01                   1648.36   
25%                  31.36                 3.88                   3249.00   
50%                  50.41                16.00                   5270.76   
75%                  73.96                36.60                   7603.84   
max                  98.01                62.57                   9880.36   

       sleep_score_squared  effective_study_score_squared  \
count            630000.00                      630000.00   
mean                250.05                         104.15   
std                 229.02                         136.82   
min                  16.81       

### Other features that i just copied

In [67]:
def create_other_features(df):
    df_temp = df.copy()
    eps = 1e-5
    
    # Log transforms
    sh_pos = df_temp['study_hours'].clip(lower=0)
    ca_pos = df_temp['class_attendance'].clip(lower=0)
    sl_pos = df_temp['sleep_hours'].clip(lower=0)

    df_temp['log_study_hours'] = np.log1p(sh_pos)
    df_temp['log_class_attendance'] = np.log1p(ca_pos)
    df_temp['log_sleep_hours'] = np.log1p(sl_pos)

    # Sqrt transforms
    df_temp['sqrt_study_hours'] = np.sqrt(sh_pos)
    df_temp['sqrt_class_attendance'] = np.sqrt(ca_pos)

    # Key interactions
    df_temp['study_hours_times_attendance'] = df_temp['study_hours'] * df_temp['class_attendance']
    df_temp['attendance_times_sleep'] = df_temp['class_attendance'] * df_temp['sleep_hours']
    df_temp['age_times_study_hours'] = df_temp['age'] * df_temp['study_hours']

    # Important ratios
    df_temp['study_hours_over_sleep'] = df_temp['study_hours'] / (df_temp['sleep_hours'] + eps)
    df_temp['attendance_over_sleep'] = df_temp['class_attendance'] / (df_temp['sleep_hours'] + eps)
    df_temp['attendance_over_study'] = df_temp['class_attendance'] / (df_temp['study_hours'] + eps)

    # Ordinal encoding
    sleep_quality_map = {'poor': 0, 'average': 1, 'good': 2}
    facility_rating_map = {'low': 0, 'medium': 1, 'high': 2}
    exam_difficulty_map = {'easy': 0, 'moderate': 1, 'hard': 2}

    df_temp['sleep_quality_numeric'] = df_temp['sleep_quality'].map(sleep_quality_map).fillna(1).astype(int)
    df_temp['facility_rating_numeric'] = df_temp['facility_rating'].map(facility_rating_map).fillna(1).astype(int)
    df_temp['exam_difficulty_numeric'] = df_temp['exam_difficulty'].map(exam_difficulty_map).fillna(1).astype(int)

    # Ordinal Ã— numeric interactions
    df_temp['study_hours_times_sleep_quality'] = df_temp['study_hours'] * df_temp['sleep_quality_numeric']
    df_temp['attendance_times_facility'] = df_temp['class_attendance'] * df_temp['facility_rating_numeric']
    df_temp['sleep_hours_times_difficulty'] = df_temp['sleep_hours'] * df_temp['exam_difficulty_numeric']

    # Ordinal Ã— ordinal interactions
    df_temp['facility_x_sleepq'] = df_temp['facility_rating_numeric'] * df_temp['sleep_quality_numeric']
    df_temp['difficulty_x_facility'] = df_temp['exam_difficulty_numeric'] * df_temp['facility_rating_numeric']

    # Rule-based flags
    df_temp["high_att_high_study"] = ((df_temp["class_attendance"] >= 90) & (df_temp["study_hours"] >= 6)).astype(int)
    df_temp["ideal_sleep_flag"] = ((df_temp["sleep_hours"] >= 7) & (df_temp["sleep_hours"] <= 9)).astype(int)
    df_temp["high_study_flag"] = (df_temp["study_hours"] >= 7).astype(int)

    # Composite efficiency
    df_temp['efficiency'] = (df_temp['study_hours'] * df_temp['class_attendance']) / (df_temp['sleep_hours'] + 1)
    
    # Binned features
    df_temp["age_bin_num"] = pd.cut(df_temp["age"], bins=[0,17,19,21,23,100], labels=[0,1,2,3,4]).astype(float)
    df_temp["study_bin_num"] = pd.cut(df_temp["study_hours"], bins=[-1,2,4,6,8,100], labels=[0,1,2,3,4]).astype(float)
    df_temp["sleep_bin_num"] = pd.cut(df_temp["sleep_hours"], bins=[-1,5,6,7,8,100], labels=[0,1,2,3,4]).astype(float)
    df_temp["attendance_bin_num"] = pd.cut(df_temp["class_attendance"], bins=[-1,60,75,85,95,101], labels=[0,1,2,3,4]).astype(float)

    # Gap features
    df_temp['sleep_gap_8'] = (df_temp['sleep_hours'] - 8.0).abs()
    
    return df_temp

# Apply to train and test
X_train = create_other_features(X_train)
X_test = create_other_features(X_test)

# Define all other features created
OTHER_FEATURES = [
    # Sqrt transforms
    'sqrt_study_hours', #'sqrt_class_attendance',
    # Key interactions
    'study_hours_times_attendance', 'attendance_times_sleep', #'age_times_study_hours',
    # Important ratios
    #'study_hours_over_sleep', #'attendance_over_sleep', 'attendance_over_study',
    # Ordinal numeric
    #'sleep_quality_numeric', 'facility_rating_numeric', 'exam_difficulty_numeric',
    # Ordinal Ã— numeric interactions
    'study_hours_times_sleep_quality', #'attendance_times_facility', #'sleep_hours_times_difficulty',
    # Ordinal Ã— ordinal interactions
    'facility_x_sleepq', #'difficulty_x_facility',
    # Rule-based flags
    'high_att_high_study', 'ideal_sleep_flag', 'high_study_flag',
    # Composite efficiency
    'efficiency',
    # Binned features
    'study_bin_num', 'sleep_bin_num', 'attendance_bin_num',
    # Gap features
    'sleep_gap_8'
]

# Add to numerical features
NUMERICAL_FEATURES = NUMERICAL_FEATURES + OTHER_FEATURES

print("âœ… Other features created!")
print(f"\nNew features ({len(OTHER_FEATURES)}): {OTHER_FEATURES}")
print(f"\nTotal numerical features: {len(NUMERICAL_FEATURES)}")

âœ… Other features created!

New features (13): ['sqrt_study_hours', 'study_hours_times_attendance', 'attendance_times_sleep', 'study_hours_times_sleep_quality', 'facility_x_sleepq', 'high_att_high_study', 'ideal_sleep_flag', 'high_study_flag', 'efficiency', 'study_bin_num', 'sleep_bin_num', 'attendance_bin_num', 'sleep_gap_8']

Total numerical features: 28


In [68]:
# Create preprocessing pipelines for each feature type

# 1. Numerical pipeline: StandardScaler
numerical_pipeline = Pipeline([
    ('scaler', StandardScaler())
])

# 2. Ordinal pipeline: OrdinalEncoder with defined categories
ordinal_pipeline = Pipeline([
    ('ordinal', OrdinalEncoder(
        categories=[ORDINAL_FEATURES[col] for col in ORDINAL_FEATURES.keys()],
        handle_unknown='use_encoded_value',
        unknown_value=-1
    ))
])

# 3. Nominal pipeline: OneHotEncoder
nominal_pipeline = Pipeline([
    ('onehot', OneHotEncoder(
        sparse_output=False,
        handle_unknown='ignore',
        drop='if_binary'  # Drop one column for binary features like internet_access
    ))
])

print("âœ… Individual pipelines created")

âœ… Individual pipelines created


In [69]:
# Combine all pipelines using ColumnTransformer
preprocessor = ColumnTransformer(
    transformers=[
        ('numerical', numerical_pipeline, NUMERICAL_FEATURES),
        ('ordinal', ordinal_pipeline, list(ORDINAL_FEATURES.keys())),
        ('nominal', nominal_pipeline, NOMINAL_FEATURES)
    ],
    remainder='drop',  # Drop any columns not specified
    verbose_feature_names_out=True
)

print("âœ… ColumnTransformer created")
print(f"\nPipeline structure:")
print(preprocessor)

âœ… ColumnTransformer created

Pipeline structure:
ColumnTransformer(transformers=[('numerical',
                                 Pipeline(steps=[('scaler', StandardScaler())]),
                                 ['age', 'study_hours', 'class_attendance',
                                  'sleep_hours', 'sleep_score', 'optimal_sleep',
                                  'sleep_deficit', 'effective_study_score',
                                  'class_time_quality', 'sleep_hours_squared',
                                  'study_hours_squared',
                                  'class_attendance_squared',
                                  'sleep_score_squared',
                                  'effective_study_scor...
                                                                              'average',
                                                                              'good'],
                                                                             ['low',
              

## 4. Apply Pipeline & Save Processed Data

In [70]:
# Fit the preprocessor on training data and transform both train and test
X_train_processed = preprocessor.fit_transform(X_train)
X_test_processed = preprocessor.transform(X_test)

# Get feature names after transformation
feature_names = preprocessor.get_feature_names_out()

print(f"âœ… Preprocessing complete!")
print(f"\nOriginal shapes:")
print(f"  X_train: {X_train.shape}")
print(f"  X_test: {X_test.shape}")
print(f"\nProcessed shapes:")
print(f"  X_train_processed: {X_train_processed.shape}")
print(f"  X_test_processed: {X_test_processed.shape}")
print(f"\nNumber of features after encoding: {len(feature_names)}")

âœ… Preprocessing complete!

Original shapes:
  X_train: (630000, 50)
  X_test: (270000, 50)

Processed shapes:
  X_train_processed: (630000, 47)
  X_test_processed: (270000, 47)

Number of features after encoding: 47


In [71]:
# View feature names
print("Feature names after preprocessing:")
for i, name in enumerate(feature_names):
    print(f"  {i+1:2d}. {name}")

Feature names after preprocessing:
   1. numerical__age
   2. numerical__study_hours
   3. numerical__class_attendance
   4. numerical__sleep_hours
   5. numerical__sleep_score
   6. numerical__optimal_sleep
   7. numerical__sleep_deficit
   8. numerical__effective_study_score
   9. numerical__class_time_quality
  10. numerical__sleep_hours_squared
  11. numerical__study_hours_squared
  12. numerical__class_attendance_squared
  13. numerical__sleep_score_squared
  14. numerical__effective_study_score_squared
  15. numerical__class_time_quality_squared
  16. numerical__sqrt_study_hours
  17. numerical__study_hours_times_attendance
  18. numerical__attendance_times_sleep
  19. numerical__study_hours_times_sleep_quality
  20. numerical__facility_x_sleepq
  21. numerical__high_att_high_study
  22. numerical__ideal_sleep_flag
  23. numerical__high_study_flag
  24. numerical__efficiency
  25. numerical__study_bin_num
  26. numerical__sleep_bin_num
  27. numerical__attendance_bin_num
  28. nu

In [72]:
# Convert to DataFrames for easier inspection
X_train_df = pd.DataFrame(X_train_processed, columns=feature_names)
X_test_df = pd.DataFrame(X_test_processed, columns=feature_names)

print("Processed training data sample:")
X_train_df.head()

Processed training data sample:


Unnamed: 0,numerical__age,numerical__study_hours,numerical__class_attendance,numerical__sleep_hours,numerical__sleep_score,numerical__optimal_sleep,numerical__sleep_deficit,numerical__effective_study_score,numerical__class_time_quality,numerical__sleep_hours_squared,...,nominal__course_ba,nominal__course_bba,nominal__course_bca,nominal__course_diploma,nominal__internet_access_yes,nominal__study_method_coaching,nominal__study_method_group study,nominal__study_method_mixed,nominal__study_method_online videos,nominal__study_method_self-study
0,0.200943,1.655875,1.538302,-1.245269,-0.624446,-0.553373,0.918294,1.243718,-0.634772,-1.176185,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
1,-1.126352,0.401573,1.308814,-1.359895,-1.352275,-0.553373,0.918294,-0.507737,0.657793,-1.253901,...,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,1.0
2,-0.241488,0.28716,1.182595,-0.729454,-1.195292,-0.553373,0.918294,-0.551242,1.913346,-0.786394,...,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0
3,-0.68392,-0.848492,-1.290141,0.703367,0.345993,1.807099,-1.088976,-0.660808,0.072722,0.640413,...,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0
4,1.085807,1.545699,0.855575,1.448434,2.087074,-0.553373,-1.088976,2.392557,1.669922,1.582308,...,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0


In [73]:
# Verify the transformations
print("=" * 60)
print("TRANSFORMATION VERIFICATION")
print("=" * 60)

print("\nðŸ“Š Numerical features (scaled):")
print(X_train_df[['numerical__age', 'numerical__study_hours', 
                   'numerical__class_attendance', 'numerical__sleep_hours']].describe().round(2))

print("\nðŸ“Š Ordinal features (encoded 0, 1, 2):")
ordinal_cols = [c for c in feature_names if 'ordinal' in c]
print(X_train_df[ordinal_cols].describe().round(2))

print("\nðŸ“Š One-hot encoded features (sample unique values):")
nominal_cols = [c for c in feature_names if 'nominal' in c]
print(f"   One-hot columns: {len(nominal_cols)}")
print(X_train_df[nominal_cols].sum().to_string())

TRANSFORMATION VERIFICATION

ðŸ“Š Numerical features (scaled):
       numerical__age  numerical__study_hours  numerical__class_attendance  \
count       630000.00               630000.00                    630000.00   
mean             0.00                   -0.00                         0.00   
std              1.00                    1.00                         1.00   
min             -1.57                   -1.66                        -1.80   
25%             -0.68                   -0.86                        -0.86   
50%              0.20                   -0.00                         0.04   
75%              1.09                    0.87                         0.87   
max              1.53                    1.66                         1.57   

       numerical__sleep_hours  
count               630000.00  
mean                    -0.00  
std                      1.00  
min                     -1.70  
25%                     -0.84  
50%                      0.02  
75%       

In [74]:
# Create train/validation split
X_tr, X_val, y_tr, y_val = train_test_split(
    X_train_processed, 
    y_train, 
    test_size=0.2, 
    random_state=42
)

print(f"Train set: {X_tr.shape[0]:,} samples")
print(f"Validation set: {X_val.shape[0]:,} samples")

Train set: 504,000 samples
Validation set: 126,000 samples


In [75]:
# Save processed data and pipeline
print("Saving processed data and pipeline...")

# Save as numpy arrays for fast loading
np.save(OUTPUT_DIR / 'X_train.npy', X_train_processed)
np.save(OUTPUT_DIR / 'y_train.npy', y_train.values)
np.save(OUTPUT_DIR / 'X_test.npy', X_test_processed)
np.save(OUTPUT_DIR / 'test_ids.npy', test_ids.values)

# Save train/val split
np.save(OUTPUT_DIR / 'X_tr.npy', X_tr)
np.save(OUTPUT_DIR / 'X_val.npy', X_val)
np.save(OUTPUT_DIR / 'y_tr.npy', y_tr.values)
np.save(OUTPUT_DIR / 'y_val.npy', y_val.values)

# Save feature names
pd.Series(feature_names).to_csv(OUTPUT_DIR / 'feature_names.csv', index=False)

# Save the preprocessor pipeline
joblib.dump(preprocessor, OUTPUT_DIR / 'preprocessor.joblib')

print(f"\nâœ… All files saved to {OUTPUT_DIR}/")
print("Files saved:")
for f in sorted(OUTPUT_DIR.glob('*')):
    print(f"  - {f.name}")

Saving processed data and pipeline...

âœ… All files saved to ../data/processed/
Files saved:
  - X_test.npy
  - X_tr.npy
  - X_train.npy
  - X_val.npy
  - feature_names.csv
  - preprocessor.joblib
  - test_ids.npy
  - y_tr.npy
  - y_train.npy
  - y_val.npy


## Summary

**Pipeline Created:**
- âœ… **Numerical features** â†’ StandardScaler (mean=0, std=1)
- âœ… **Ordinal features** â†’ OrdinalEncoder (poor=0, average=1, good=2, etc.)
- âœ… **Nominal features** â†’ OneHotEncoder (binary dropped for binary features)

**Data Saved:**
- `X_train.npy`, `y_train.npy` - Full training data
- `X_test.npy`, `test_ids.npy` - Test data for submission
- `X_tr.npy`, `X_val.npy`, `y_tr.npy`, `y_val.npy` - Train/validation split
- `feature_names.csv` - Column names after encoding
- `preprocessor.joblib` - Saved pipeline for inference

**Usage in modeling notebook:**
```python
import numpy as np
import joblib

# Load processed data
X_train = np.load('../data/processed/X_train.npy')
y_train = np.load('../data/processed/y_train.npy')

# Load pipeline for new data
preprocessor = joblib.load('../data/processed/preprocessor.joblib')
```