## Next Steps: Data Preparation for Modeling

### Why Prepare Data?
Before building predictive models, we need to:
1. **Handle Categorical Variables**: Convert text categories (smoker/region/sex) to numerical values
2. **Feature Engineering**: Create meaningful derived features (e.g., BMI categories)
3. **Normalization**: Scale numerical features to similar ranges
4. **Train-Test Split**: Separate data for model training and evaluation

### Key Transformations Needed:
```python
# Example preprocessing steps:
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.model_selection import train_test_split

# 1. Convert smoker to binary (1/0)
medical_df['smoker'] = medical_df['smoker'].map({'yes':1, 'no':0})

# 2. Create BMI categories feature
medical_df['bmi_category'] = pd.cut(medical_df['bmi'], 
                                   bins=[0, 18.5, 25, 30, 100],
                                   labels=['underweight', 'normal', 'overweight', 'obese'])

# 3. One-hot encode categorical features
encoder = OneHotEncoder()
encoded_features = encoder.fit_transform(medical_df[['sex', 'region', 'bmi_category']])

# 4. Scale numerical features
scaler = StandardScaler()
scaled_features = scaler.fit_transform(medical_df[['age', 'bmi', 'children']])

# 5. Combine all features
X = np.concatenate([scaled_features, encoded_features.toarray()], axis=1)
y = medical_df['charges']

# 6. Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
```

### Why Each Step Matters:
1. **Categorical Encoding**: Models require numerical input
2. **BMI Categories**: May reveal non-linear relationships
3. **Feature Scaling**: Ensures equal contribution from all features
4. **Train-Test Split**: Evaluates model performance on unseen data