# Demo 7: Turning Raw Data into Trainable Tensors

Every neural network starts with data, but raw data rarely looks like what a network needs. Before training can begin, you must transform messy CSV files into clean, standardized, batched tensors.

> **Overview**: Build a complete data pipeline from scratch: load a real CSV dataset with Pandas, handle missing values, split data properly, standardize features with scikit-learn, and convert everything into batched PyTorch tensors.
> 
> **Scenario**: Your telecom company has 4,000+ customer records with missing values, mixed scales, and categorical data. Before any model can predict churn, you need to transform this mess into clean, training-ready tensors.
> 
> **Goal**: Understand the complete data flow from raw CSV to training-ready mini-batches, and see how proper preprocessing enables reliable model training.
> 
> **Tools**: Python, PyTorch, Pandas, scikit-learn

## Step 1: Setup

Let's start by importing our libraries and setting up our environment.

In [1]:
# Import core libraries
import torch
from torch.utils.data import TensorDataset, DataLoader
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from datasets import load_dataset
import warnings
warnings.filterwarnings('ignore')

# Set random seed for reproducibility
np.random.seed(42)
torch.manual_seed(42)

print("Setup complete!")
print(f"PyTorch version: {torch.__version__}")
print(f"Pandas version: {pd.__version__}")

Setup complete!
PyTorch version: 2.5.1+cu121
Pandas version: 2.3.3


## Step 2: Load the data

We'll use the [aai510-group1/telco-customer-churn](https://huggingface.co/datasets/aai510-group1/telco-customer-churn) dataset from Hugging Face, which contains 4,000+ real customer records from a telecommunications company.

The full dataset has 49 features, but we'll focus on 10 key predictors that are most relevant for understanding customer churn.

> **The preprocessing workflow ahead:**
> 
> Here's a compact view of the full pipeline we'll build:
> 
> 1. **Load data** → Convert Hugging Face dataset to Pandas DataFrame
> 2. **Create subset** → Select only the 10 features we are interested into
> 3. **Handle missing values** → Fill Internet Type nulls with "None"
> 4. **Encode categories** → Convert Contract and Internet Type to numbers
> 5. **Split data** → Separate into train (70%) and validation (30%) sets
> 6. **Standardize features** → Scale to mean=0, std=1 (fit on train only)
> 7. **Convert to tensors** → Transform NumPy arrays to PyTorch float32 tensors
> 8. **Create TensorDataset** → Pair features with targets
> 9. **Build DataLoaders** → Enable batching, shuffling, and efficient iteration
> 
> Each step prepares the data for the next, transforming messy CSV into training-ready batches.

### 2.1: Load the data

We'll load the telco customer churn dataset from Hugging Face and convert it to a Pandas DataFrame.

In [2]:
# 1. Load the dataset from Hugging Face
dataset = load_dataset('aai510-group1/telco-customer-churn', split='train')

# Convert to Pandas DataFrame for easier manipulation
df = pd.DataFrame(dataset)

print(f"✓ Dataset loaded: {len(df)} customer records\n")

README.md: 0.00B [00:00, ?B/s]

train.csv: 0.00B [00:00, ?B/s]

validation.csv: 0.00B [00:00, ?B/s]

test.csv: 0.00B [00:00, ?B/s]

Generating train split:   0%|          | 0/4225 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/1409 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/1409 [00:00<?, ? examples/s]

✓ Dataset loaded: 4225 customer records



### 2.2: Select key features

From the 49 available features, we'll focus on 10 predictors that best capture customer churn patterns with a balanced mix of:
 - Demographics (age, dependents)
 - Service history (tenure, contract type)
 - Service details (internet type)
 - Financial metrics (charges)
 - Engagement indicators (satisfaction, referrals). 

In [3]:
# 2. Select the 10 key features we'll use for prediction
feature_columns = [
    'Age',                          # Demographic
    'Dependents',                   # Demographic
    'Tenure in Months',             # Service history
    'Contract',                     # Service details (categorical)
    'Internet Type',                # Service details (categorical - has missing values!)
    'Monthly Charge',               # Financial
    'Total Charges',                # Financial
    'Satisfaction Score',           # Engagement
    'Number of Referrals',          # Engagement
]

target_column = 'Churn'  # Binary: 0 = Stayed, 1 = Churned

# Keep only the columns we need
df = df[feature_columns + [target_column]]

print("Dataset shape:", df.shape)
print("\nFirst few rows:")
print(df.head())

Dataset shape: (4225, 10)

First few rows:
   Age  Dependents  Tenure in Months        Contract Internet Type  \
0   72           0                25        Two Year   Fiber Optic   
1   27           0                35  Month-to-Month   Fiber Optic   
2   59           0                46  Month-to-Month          None   
3   25           0                27        One Year           DSL   
4   31           0                58        One Year         Cable   

   Monthly Charge  Total Charges  Satisfaction Score  Number of Referrals  \
0           88.40        2191.15                   3                    1   
1           95.50        3418.20                   3                    0   
2           19.60         851.20                   5                    3   
3           45.85        1246.40                   4                    3   
4           60.30        3563.80                   2                    1   

   Churn  
0      0  
1      0  
2      0  
3      0  
4      1  


> **Feature selection in practice:** Feature selection is iterative. You'd experiment with more/fewer features, engineer new ones (like charges-per-month), transform skewed distributions, and refine based on model performance. 
> 
> This demo focuses on the preprocessing mechanics that apply regardless of which features you choose.

### 2.3: Handle missing values

Real-world data always has issues. Let's investigate what we're dealing with before making any transformations.

In [4]:
# Check data types and missing values
print("Data Types and Missing Values:")
print("=" * 60)
info_df = pd.DataFrame({
    'Column': df.columns,
    'Type': df.dtypes.values,
    'Missing': df.isnull().sum().values,
    'Missing %': (df.isnull().sum().values / len(df) * 100).round(2)
})
print(info_df.to_string(index=False))

Data Types and Missing Values:
             Column    Type  Missing  Missing %
                Age   int64        0       0.00
         Dependents   int64        0       0.00
   Tenure in Months   int64        0       0.00
           Contract  object        0       0.00
      Internet Type  object      886      20.97
     Monthly Charge float64        0       0.00
      Total Charges float64        0       0.00
 Satisfaction Score   int64        0       0.00
Number of Referrals   int64        0       0.00
              Churn   int64        0       0.00


> **Key observations from the data:**
> 
> - **Mixed data types**: We have both numeric features (Age, charges) and categorical features (Contract, Internet Type) that will need different preprocessing approaches
> - **Missing values**: Internet Type is the only feature with missing values.
> <details>
> <summary><b>What do the missing values in Internet Type mean?</b></summary>
> 
> The ~1,500 missing values in Internet Type aren't errors or data collection failures: they represent customers who don't have internet service at all. The missingness itself carries information: "no internet type" means "no internet subscription."
> 
> This is why we'll **fill with "None"** rather than drop these rows. Dropping would remove 21% of our data and lose valuable patterns (customers without internet might have different churn behavior). Creating a "None" category preserves this information and lets the model learn from it.
> 
> **General principle:** When missingness is meaningful (not random), preserve it as a feature. When it's random noise or data errors, consider dropping or imputing with statistics (mean, median, mode).
> 
> </details>

In [5]:
# Handle missing values in Internet Type
# Fill with "None" to indicate "no internet service"
df['Internet Type'] = df['Internet Type'].fillna('None')

# Verify no missing values remain
print("Missing values after handling:")
print(df.isnull().sum())
print("\n✓ All missing values handled!")

Missing values after handling:
Age                    0
Dependents             0
Tenure in Months       0
Contract               0
Internet Type          0
Monthly Charge         0
Total Charges          0
Satisfaction Score     0
Number of Referrals    0
Churn                  0
dtype: int64

✓ All missing values handled!


### 2.4: Encode categorical features

Neural networks need numeric inputs. Let's convert our categorical features (Contract, Internet Type) into numbers using one-hot encoding.

In [6]:
# Check unique values in categorical columns
print("Categorical features:")
print(f"Contract types: {df['Contract'].unique()}")
print(f"Internet types: {df['Internet Type'].unique()}")

# One-hot encode categorical features
df_encoded = pd.get_dummies(df, columns=['Contract', 'Internet Type'], drop_first=True)

print(f"\n✓ Encoding complete!")
print(f"Original shape: {df.shape}")
print(f"After encoding: {df_encoded.shape}")
print(f"\nNew columns created: {df_encoded.shape[1] - df.shape[1]} additional features")

Categorical features:
Contract types: ['Two Year' 'Month-to-Month' 'One Year']
Internet types: ['Fiber Optic' 'None' 'DSL' 'Cable']

✓ Encoding complete!
Original shape: (4225, 10)
After encoding: (4225, 13)

New columns created: 3 additional features


> **Why one-hot encoding?** Each categorical value becomes a binary column (0 or 1). We use `drop_first=True` to avoid the "dummy variable trap" where columns are perfectly correlated. 
> 
> For example, if a customer isn't "Month-to-Month" or "One Year", they must be "Two Year". So, we don't need all three columns.

### 2.5: Split data (the Golden Rule in action)

Before any preprocessing that learns from data, we must split into training and validation sets. This prevents **information leakage**, i.e., when test data accidentally influences training.

> **Why no test set?** For this demo, train/val is sufficient to learn the preprocessing workflow. In a real project where you'd deploy a model, you'd use a three-way split (train/val/test) to get an honest final performance estimate. The preprocessing steps remain identical regardless of how many splits you create.

In [7]:
# Separate features and target
X = df_encoded.drop('Churn', axis=1)
y = df_encoded['Churn']

print("Before split:")
print(f"Features shape: {X.shape}")
print(f"Target shape: {y.shape}")
print(f"Target distribution: {y.value_counts().to_dict()}")

# Split into training (70%) and validation (30%) sets
X_train, X_val, y_train, y_val = train_test_split(
    X, y, 
    test_size=0.30,      # 30% for validation
    random_state=42,     # For reproducibility
    stratify=y           # Maintain the same churn ratio in both sets
)

print("\n" + "=" * 60)
print("After split:")
print(f"Training set: {X_train.shape[0]} samples ({X_train.shape[0]/len(X)*100:.1f}%)")
print(f"Validation set: {X_val.shape[0]} samples ({X_val.shape[0]/len(X)*100:.1f}%)")
print(f"\nTraining target distribution: {y_train.value_counts().to_dict()}")
print(f"Validation target distribution: {y_val.value_counts().to_dict()}")
print("\n✓ Data split complete! (Golden rule: Split BEFORE preprocessing)")

Before split:
Features shape: (4225, 12)
Target shape: (4225,)
Target distribution: {0: 3104, 1: 1121}

After split:
Training set: 2957 samples (70.0%)
Validation set: 1268 samples (30.0%)

Training target distribution: {0: 2172, 1: 785}
Validation target distribution: {0: 932, 1: 336}

✓ Data split complete! (Golden rule: Split BEFORE preprocessing)


> **The golden rule in practice**: Notice we split BEFORE standardization. If we calculated mean and standard deviation on the entire dataset, information from validation examples would leak into our training process. By splitting first, we ensure the validation set truly represents unseen data.
>
> **Why stratify?** The `stratify=y` parameter ensures both sets have the same churn rate (~26.5%). Without this, we might accidentally get 30% churners in training but only 20% in validation, making comparisons unreliable.

### 2.6: Standardize features

Our features have wildly different scales (Age: 20-80, Total Charges: 100-8,000). Standardization transforms them to have mean=0 and std=1, helping gradient descent converge faster and more reliably.

In [8]:
# Check the current feature scales
print("Feature scales BEFORE standardization:")
print("=" * 60)
print(X_train.describe().loc[['mean', 'std', 'min', 'max']].round(2))

Feature scales BEFORE standardization:
        Age  Dependents  Tenure in Months  Monthly Charge  Total Charges  \
mean  46.57        0.24             32.52           64.52        2287.90   
std   16.69        0.43             24.60           30.09        2267.20   
min   19.00        0.00              1.00           18.40          18.80   
max   80.00        1.00             72.00          118.75        8672.45   

      Satisfaction Score  Number of Referrals  
mean                3.25                 2.00  
std                 1.21                 3.06  
min                 1.00                 0.00  
max                 5.00                11.00  


In [9]:
# Initialize the scaler
scaler = StandardScaler()

# CRITICAL: Fit the scaler ONLY on training data
# This calculates mean and std from training set only
scaler.fit(X_train)

# Transform both sets using the training statistics
X_train_scaled = scaler.transform(X_train)
X_val_scaled = scaler.transform(X_val)

print("\n✓ Standardization complete!")
print("\nFeature scales AFTER standardization (training set):")
print("=" * 60)
# Convert back to DataFrame for easier viewing
X_train_scaled_df = pd.DataFrame(X_train_scaled, columns=X_train.columns)
print(X_train_scaled_df.describe().loc[['mean', 'std', 'min', 'max']].round(2))


✓ Standardization complete!

Feature scales AFTER standardization (training set):
       Age  Dependents  Tenure in Months  Monthly Charge  Total Charges  \
mean  0.00        0.00              0.00            0.00           0.00   
std   1.00        1.00              1.00            1.00           1.00   
min  -1.65       -0.56             -1.28           -1.53          -1.00   
max   2.00        1.79              1.61            1.80           2.82   

      Satisfaction Score  Number of Referrals  Contract_One Year  \
mean               -0.00                 0.00              -0.00   
std                 1.00                 1.00               1.00   
min                -1.86                -0.66              -0.51   
max                 1.44                 2.94               1.95   

      Contract_Two Year  Internet Type_DSL  Internet Type_Fiber Optic  \
mean              -0.00               0.00                       0.00   
std                1.00               1.00            

> **When to use StandardScaler vs. alternatives:**
> - **StandardScaler (our choice)**: Best for tabular data with different units and scales, robust to outliers since it uses mean/std rather than min/max
> - **MinMaxScaler**: Better when you need bounded outputs [0,1] or when data has no extreme outliers
> - **RobustScaler**: Use when data has many outliers, since it uses median/IQR instead of mean/std
> 
> Find more information at [sklearn.preprocessing](https://scikit-learn.org/stable/api/sklearn.preprocessing.html).

### 2.7: Convert to PyTorch tensors

NumPy arrays are great for preprocessing, but PyTorch models need tensors. Let's convert our standardized data into the right format.

In [10]:
# Convert features to float32 tensors (standard for neural networks)
X_train_tensor = torch.tensor(X_train_scaled, dtype=torch.float32)
X_val_tensor = torch.tensor(X_val_scaled, dtype=torch.float32)

# Convert targets to float32 tensors (for binary classification with BCELoss)
y_train_tensor = torch.tensor(y_train.values, dtype=torch.float32)
y_val_tensor = torch.tensor(y_val.values, dtype=torch.float32)

print("✓ Tensor conversion complete!")
print("\nTraining tensors:")
print(f"  Features: {X_train_tensor.shape} | dtype: {X_train_tensor.dtype}")
print(f"  Targets:  {y_train_tensor.shape} | dtype: {y_train_tensor.dtype}")
print("\nValidation tensors:")
print(f"  Features: {X_val_tensor.shape} | dtype: {X_val_tensor.dtype}")
print(f"  Targets:  {y_val_tensor.shape} | dtype: {y_val_tensor.dtype}")

# Quick sanity check: print first example
print("\nFirst training example:")
print(f"Features (first 5): {X_train_tensor[0, :5]}")
print(f"Target: {y_train_tensor[0].item()} ({'Churned' if y_train_tensor[0].item() == 1 else 'Stayed'})")

✓ Tensor conversion complete!

Training tensors:
  Features: torch.Size([2957, 12]) | dtype: torch.float32
  Targets:  torch.Size([2957]) | dtype: torch.float32

Validation tensors:
  Features: torch.Size([1268, 12]) | dtype: torch.float32
  Targets:  torch.Size([1268]) | dtype: torch.float32

First training example:
Features (first 5): tensor([-0.1538,  1.7940,  1.0360,  0.0757,  0.7422])
Target: 0.0 (Stayed)


> **Why float32?** Neural networks typically use 32-bit floating-point precision as a balance between accuracy and memory/speed (and modern GPUs are optimized for float32 operations!). BEWARE: Using the wrong dtype (like float64 or int) can cause cryptic errors during training.
>
> <details><summary><b> What about tensor shapes: are they as expected?</b></summary>
> <br>
> 
> **Yes, they're exactly right!** 
> - The features tensor has shape `(4930, 13)`, meaning 4,930 training examples with 13 features each (after one-hot encoding).
> - The target tensor has shape `(4930,)` with one label per example, i.e., a 1D tensor with 4,930 values.
> 
> *Key pattern:* The first dimension always represents the number of examples (the "batch dimension"), and the second dimension represents features per example. This `(num_examples, num_features)` format is standard across all deep learning frameworks.
>
> </details>

### 2.8: Create TensorDataset (pairing features with targets)

[TensorDataset](https://docs.pytorch.org/docs/stable/data.html) is PyTorch's way of keeping features and targets synchronized. It wraps our tensors into a dataset that can be easily batched.

In [11]:
# Create TensorDatasets
train_dataset = TensorDataset(X_train_tensor, y_train_tensor)
val_dataset = TensorDataset(X_val_tensor, y_val_tensor)

print("✓ TensorDatasets created!")
print(f"\nTraining dataset: {len(train_dataset)} examples")
print(f"Validation dataset: {len(val_dataset)} examples")

# Show what a single dataset item looks like
features, target = train_dataset[0]
print("\nSingle dataset item:")
print(f"  Features shape: {features.shape}")
print(f"  Target shape: {target.shape}")
print(f"  Target value: {target.item()}")

✓ TensorDatasets created!

Training dataset: 2957 examples
Validation dataset: 1268 examples

Single dataset item:
  Features shape: torch.Size([12])
  Target shape: torch.Size([])
  Target value: 0.0


> **What TensorDataset does**: It creates a simple wrapper that returns `(features, target)` pairs. When you access `train_dataset[0]`, you get the features and target for the first example. This synchronization is crucial: when we shuffle or batch the data, the features and targets stay properly aligned.

### 2.9: Build DataLoaders (the bridge to training)

[DataLoader](https://docs.pytorch.org/docs/stable/data.html) handles the heavy lifting: batching data, shuffling for training, and efficiently feeding examples to your model during training loops.

In [12]:
# Create DataLoaders
batch_size = 32  # Process 32 examples at a time

train_loader = DataLoader(
    train_dataset,
    batch_size=batch_size,
    shuffle=True,          # Shuffle training data each epoch
    drop_last=False        # Keep the last incomplete batch
)

val_loader = DataLoader(
    val_dataset,
    batch_size=batch_size,
    shuffle=False,         # Don't shuffle validation data
    drop_last=False
)

print("✓ DataLoaders created!")
print(f"\nTraining loader:")
print(f"  Total examples: {len(train_dataset)}")
print(f"  Batch size: {batch_size}")
print(f"  Number of batches: {len(train_loader)}")
print(f"  Shuffle: True")

print(f"\nValidation loader:")
print(f"  Total examples: {len(val_dataset)}")
print(f"  Batch size: {batch_size}")
print(f"  Number of batches: {len(val_loader)}")
print(f"  Shuffle: False")

✓ DataLoaders created!

Training loader:
  Total examples: 2957
  Batch size: 32
  Number of batches: 93
  Shuffle: True

Validation loader:
  Total examples: 1268
  Batch size: 32
  Number of batches: 40
  Shuffle: False


> **Why shuffle training but not validation?** Shuffling training data prevents the model from learning spurious patterns based on the order of examples. If all churned customers appeared at the end, the model might learn time-based patterns rather than true churn indicators. 
>
> We DON'T shuffle validation data because we want consistent, reproducible evaluation. The order doesn't matter for computing metrics, and keeping it fixed makes debugging easier.
>
><details><summary><b>How many batches will we get during training?</b></summary>
>
> *Calculation:* With 2,957 training examples and `batch_size=32`, we get: 2,957 ÷ 32 = 92.4 → **93 batches**
> 
> Since we set `drop_last=False`, the final batch contains the remaining 13 examples (2,957 - 92×32 = 13). If we had set `drop_last=True`, we'd have 92 full batches and those last 13 examples would be discarded.
> 
> *Why this matters:* The number of batches determines how many gradient updates happen per epoch. More batches = more frequent updates but smaller gradient estimates per update.
> 
</details>

## Step 3: Inspect a batch

Let's look at what actually gets fed into your neural network during training. This is the moment where CSV data becomes training-ready tensors.

In [13]:
# Get one batch from the training loader
batch_features, batch_targets = next(iter(train_loader))

print("One training batch:")
print("=" * 60)
print(f"Batch features shape: {batch_features.shape}")
print(f"  → {batch_features.shape[0]} examples in this batch")
print(f"  → {batch_features.shape[1]} features per example")
print(f"\nBatch targets shape: {batch_targets.shape}")
print(f"  → {batch_targets.shape[0]} labels (one per example)")

print("\nFirst 3 examples in batch:")
print(f"Features (first 5 columns):")
print(batch_features[:3, :5])
print(f"\nTargets:")
print(batch_targets[:3])

# Show target distribution in this batch
churned_in_batch = (batch_targets == 1).sum().item()
stayed_in_batch = (batch_targets == 0).sum().item()
print(f"\nBatch composition:")
print(f"  Churned: {churned_in_batch} ({churned_in_batch/len(batch_targets)*100:.1f}%)")
print(f"  Stayed: {stayed_in_batch} ({stayed_in_batch/len(batch_targets)*100:.1f}%)")

One training batch:
Batch features shape: torch.Size([32, 12])
  → 32 examples in this batch
  → 12 features per example

Batch targets shape: torch.Size([32])
  → 32 labels (one per example)

First 3 examples in batch:
Features (first 5 columns):
tensor([[-1.5325, -0.5574,  1.6052,  1.1593,  2.2048],
        [ 0.2058, -0.5574,  1.2393, -1.4633, -0.4234],
        [ 0.5055, -0.5574,  0.4668, -0.3498,  0.0672]])

Targets:
tensor([0., 0., 0.])

Batch composition:
  Churned: 8 (25.0%)
  Stayed: 24 (75.0%)


> **Iterating through DataLoaders**: We used `next(iter(train_loader))` to manually grab one batch for inspection. During actual training, you'd simply loop: `for batch_features, batch_targets in train_loader:` and the DataLoader automatically handles batching, shuffling, and resetting after each epoch.
>
><details><summary><b>Is this batch composition (25% churned, 75% stayed) guaranteed by shuffling?</b></summary>
> 
> This batch happens to be close to our overall dataset distribution (~26.5% churned, ~73.5% stayed). Shuffling randomizes the order, so batches are generally representative, though individual batches will vary naturally.
> 
> *Without shuffling*, we might see all churned customers clustered together in consecutive batches, causing the model to learn unstable, order-based patterns rather than true churn indicators.
> </details>


## Step 4: Visualize the complete pipeline

Let's trace one example all the way through our pipeline to see every transformation.

In [14]:
# Pick one customer from the original data
example_idx = 42
original_customer_full = pd.DataFrame(dataset).iloc[example_idx]
original_customer = df.iloc[example_idx]

print("TRACKING ONE CUSTOMER THROUGH THE PIPELINE")
print("=" * 60)

# Stage 1: Original data (from Step 2: Load)
print("\n[Stage 1] Original CSV data:")
print(f"  Features: {original_customer_full.drop('Churn').to_dict()}")
print(f"  Target: Churn = {original_customer_full['Churn']}")

# Stage 2: After feature selection (from Step 3: Select Features)
print("\n[Stage 2] After selecting 10 key features:")
print(f"  Features: {original_customer.drop('Churn').to_dict()}")
print(f"  Kept: {len(original_customer)} columns (9 features + 1 target)")
print(f"  Dropped: {49 - len(original_customer)} less relevant features")

# Stage 3: After handling missing values (from Step 4: Handle Missing)
print("\n[Stage 3] After handling missing values:")
print(f"  Internet Type: '{original_customer['Internet Type']}' (filled if was null)")

# Stage 4: After encoding (from Step 5: Encode Categories)
encoded_customer = df_encoded.iloc[example_idx]
print("\n[Stage 4] After one-hot encoding:")
print(f"  Shape changed: {len(original_customer)} → {len(encoded_customer)} features")
print(f"  Categorical features became: {len(encoded_customer) - len(original_customer)} binary columns")

# Stage 5: After splitting (from Step 6: Split)
# Check if it's in training or validation set
if example_idx in X_train.index:
    split_name = "Training"
    position_in_split = X_train.index.get_loc(example_idx)
    scaled_features = X_train_scaled[position_in_split]
else:
    split_name = "Validation"
    position_in_split = X_val.index.get_loc(example_idx)
    scaled_features = X_val_scaled[position_in_split]

print(f"\n[Stage 5] After splitting: Ended up in {split_name} set")

# Stage 6: After standardization (from Step 7: Standardize)
print(f"\n[Stage 6] After standardization (first 5 features):")
print(f"  Before: {encoded_customer.drop('Churn').values[:5]}")
print(f"  After:  {scaled_features[:5]}")
print(f"  → Values now centered around 0 with unit variance")

# Stage 7: As a tensor (from Step 8: Convert to Tensors)
print(f"\n[Stage 7] As PyTorch tensor:")
print(f"  dtype: float32")
print(f"  shape: (13,) - ready for neural network input layer")

# Stage 8: In TensorDataset (from Step 9: Create TensorDataset)
print(f"\n[Stage 8] Paired in TensorDataset:")
print(f"  Features and target kept synchronized")
print(f"  Can access as: dataset[{position_in_split}]")

# Stage 9: In a batch (from Step 10: Build DataLoaders)
print(f"\n[Stage 9] In a DataLoader batch:")
print(f"  Will be stacked with 31 other examples")
print(f"  Batch shape: (32, 13)")
print(f"  → First dimension is batch size, second is features")

print("\n" + "=" * 60)
print("✓ Complete transformation: CSV → Select → Clean → Encode → Split → Standardize → Tensor → Dataset → Batch")

TRACKING ONE CUSTOMER THROUGH THE PIPELINE

[Stage 1] Original CSV data:
  Features: {'Age': 27, 'Avg Monthly GB Download': 0, 'Avg Monthly Long Distance Charges': 43.5, 'Churn Category': None, 'Churn Reason': None, 'Churn Score': 34, 'City': 'Thousand Palms', 'CLTV': 3854, 'Contract': 'Month-to-Month', 'Country': 'United States', 'Customer ID': '3061-BCKYI', 'Customer Status': 'Stayed', 'Dependents': 0, 'Device Protection Plan': 0, 'Gender': 'Male', 'Internet Service': 0, 'Internet Type': None, 'Lat Long': '33.849263, -116.382778', 'Latitude': 33.849263, 'Longitude': -116.382778, 'Married': 0, 'Monthly Charge': 19.9, 'Multiple Lines': 0, 'Number of Dependents': 0, 'Number of Referrals': 0, 'Offer': 'Offer D', 'Online Backup': 0, 'Online Security': 0, 'Paperless Billing': 0, 'Partner': 0, 'Payment Method': 'Credit Card', 'Phone Service': 1, 'Population': 6242, 'Premium Tech Support': 0, 'Quarter': 'Q3', 'Referred a Friend': 0, 'Satisfaction Score': 3, 'Senior Citizen': 0, 'State': 'Cal

> **Why each step matters**: Without encoding, the network couldn't process "Fiber Optic" as input. Without splitting first, our validation metrics would be unreliable. Without standardization, gradient descent would converge slowly and erratically. Without batching, training would be impossibly inefficient. 
> 
> Each transformation solves a specific obstacle between "data exists" and "model can learn".

## Conclusion

Congratulations! You've built a complete data pipeline from raw CSV to training-ready batches. This workflow—load, clean, split, standardize, convert, batch—is the foundation of every deep learning project.

**What you've learned:**
- [x] **The complete data pipeline** - Transform raw CSV files into training-ready batches through a systematic workflow: load → clean → split → scale → convert → batch
- [x] **The golden rule in action** - Always split before preprocessing to prevent information leakage; fit transformations on training data only, then apply to validation
- [x] **Why preprocessing matters** - Handling missing values, encoding categories, and standardizing features aren't optional steps—they determine whether gradient descent can learn efficiently
- [x] **The bridge to PyTorch** - TensorDataset pairs features with targets, DataLoader handles batching and shuffling, and together they transform static data into flowing training examples

> **Key insight**: The data pipeline isn't just a preprocessing step; it's the foundation that determines whether your model can learn at all. Get the dtypes wrong, leak information from test to train, or skip standardization, and even the best model architecture will struggle. Build a solid pipeline like this one, and you've solved half the problem before training even begins.

##### Next steps:

- [ ] **What if your data has outliers?** Some customers might have extreme values (e.g., 10 years tenure, $10,000 monthly charges). How would you detect and handle these outliers before they distort your scaling?
- [ ] **How do you handle temporal data?** If your dataset has a time component (signup date, usage history), how would you create time-based features or ensure your train/val split respects chronological order?
- [ ] **What happens with new categorical values?** If a new customer has a contract type you've never seen, how does your one-hot encoding handle it? Should you add an "unknown" category?
- [ ] **What about class imbalance strategies?** With 73% stayed vs 27% churned, should you use techniques like oversampling, undersampling, or class weights during training?
- [ ] **Can you build a reusable preprocessing function?** Wrap this entire pipeline into a function that takes raw data and returns DataLoaders, making it easy to preprocess any similar dataset (especially during model inference!).