# Assignment 2: Engineering Predictive Features

**Student Name:** [Your Name Here]

**Date:** [Date]

---

## Assignment Overview

In this assignment, you'll practice feature engineering by creating new predictive features from the Ames Housing dataset. You'll build a baseline model with raw features, engineer at least 5 new features based on real estate intuition, and measure how feature engineering improves model performance.

---

## Step 1: Import Libraries and Load Data

In [1]:
# Import required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import r2_score, mean_squared_error

# Set random seed for reproducibility
np.random.seed(42)

# Set plotting style
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (10, 6)

print("✓ Libraries imported successfully")

✓ Libraries imported successfully


In [None]:
# Load the Ames Housing dataset
# TODO: Load train.csv from the data folder
df = None  # Replace with pd.read_csv()

# Display basic information
# TODO: Display the first few rows and basic info about the dataset


print("\n" + "="*80)
print("CHECKPOINT: Verify dataset loaded correctly")
print(f"Dataset shape: {df.shape if df is not None else 'Not loaded'}")
print("="*80)

---
## Step 2: Build Baseline Model with Raw Features

### Select Raw Features for Baseline

Select 10-15 raw features to use in your baseline model. Here's a suggested starting set (you can adjust):

**Suggested features:**
- `GrLivArea` - Above grade living area square feet
- `OverallQual` - Overall material and finish quality
- `YearBuilt` - Original construction year
- `TotalBsmtSF` - Total basement square feet
- `FullBath` - Full bathrooms above grade
- `BedroomAbvGr` - Bedrooms above grade
- `GarageArea` - Size of garage in square feet
- `LotArea` - Lot size in square feet
- `Neighborhood` - Physical location (categorical)
- Add 5-10 more features you think are important

In [None]:
# Select features for baseline model
# TODO: Create a list of feature names you want to use
baseline_features = [
    'GrLivArea',
    'OverallQual',
    'YearBuilt',
    # Add more features here
]

# TODO: Create X (features) and y (target) for baseline
# Make sure to handle missing values and encode categorical variables
X_baseline = None  # Replace with your feature matrix
y = None  # Replace with df['SalePrice']

print(f"Baseline features selected: {len(baseline_features)}")
print(f"Target variable shape: {y.shape if y is not None else 'Not defined'}")

### Preprocess Baseline Features

In [None]:
# Handle missing values
# TODO: Fill missing values appropriately
# Numeric: Use median or 0
# Categorical: Use 'None' or most frequent


# Encode categorical variables
# TODO: Use pd.get_dummies() for categorical features


print("\n" + "="*80)
print("CHECKPOINT: After preprocessing")
print(f"X_baseline shape: {X_baseline.shape if X_baseline is not None else 'Not defined'}")
print(f"Missing values: {X_baseline.isnull().sum().sum() if X_baseline is not None else 'N/A'}")
print("="*80)

### Train Baseline Model

In [None]:
# Split data into train and test sets
# TODO: Use train_test_split with test_size=0.2, random_state=42
X_train, X_test, y_train, y_test = None, None, None, None  # Replace with train_test_split()

# Train baseline Random Forest model
# TODO: Create and train RandomForestRegressor(n_estimators=100, random_state=42)
baseline_model = None  # Replace with trained model

# Make predictions
# TODO: Generate predictions on test set
baseline_predictions = None  # Replace with predictions

# Calculate metrics
# TODO: Calculate R² and RMSE
baseline_r2 = None  # Replace with r2_score()
baseline_rmse = None  # Replace with np.sqrt(mean_squared_error())

print("\n" + "="*80)
print("BASELINE MODEL RESULTS")
print("="*80)
print(f"R² Score: {baseline_r2 if baseline_r2 is not None else 'Not calculated'}")
print(f"RMSE: ${baseline_rmse:,.2f}" if baseline_rmse is not None else "RMSE: Not calculated")
print("="*80)

### Visualize Baseline Feature Importances

In [None]:
# Extract and visualize feature importances
# TODO: Get feature importances from baseline_model
# TODO: Create a horizontal bar plot of top 10 features


print("\n" + "="*80)
print("CHECKPOINT: Review which raw features are most important")
print("="*80)

---
## Step 3: Engineer New Features

### Feature 1: [Feature Name] - [Category]

**Business Justification:**
[Write 2-3 sentences explaining what this feature measures, why it should predict house prices, and what real estate intuition supports it]

In [None]:
# TODO: Create your first engineered feature
# Example: df['total_bathrooms'] = df['FullBath'] + 0.5 * df['HalfBath']


### Feature 2: [Feature Name] - [Category]

**Business Justification:**
[Write 2-3 sentences explaining this feature]

In [None]:
# TODO: Create your second engineered feature


### Feature 3: [Feature Name] - [Category]

**Business Justification:**
[Write 2-3 sentences explaining this feature]

In [None]:
# TODO: Create your third engineered feature


### Feature 4: [Feature Name] - [Category]

**Business Justification:**
[Write 2-3 sentences explaining this feature]

In [None]:
# TODO: Create your fourth engineered feature


### Feature 5: [Feature Name] - [Category]

**Business Justification:**
[Write 2-3 sentences explaining this feature]

In [None]:
# TODO: Create your fifth engineered feature


### Add More Engineered Features (Optional)

You can create additional features beyond the required 5 if you think they'll improve performance.

In [None]:
# Optional: Create additional engineered features


---
## Step 4: Train Model with Engineered Features

In [None]:
# Create feature list combining baseline + engineered features
# TODO: List all your engineered feature names
engineered_features = [
    # Add your engineered feature names here
]

# Combine baseline and engineered features
all_features = baseline_features + engineered_features

# TODO: Create X_engineered with all features
# Remember to handle missing values and encode categoricals
X_engineered = None  # Replace with your feature matrix

print(f"Total features in engineered model: {len(all_features)}")
print(f"New engineered features: {len(engineered_features)}")

In [None]:
# Split data (use same random_state for fair comparison)
# TODO: Split X_engineered and y
X_train_eng, X_test_eng, y_train_eng, y_test_eng = None, None, None, None

# Train model with engineered features
# TODO: Train RandomForestRegressor(n_estimators=100, random_state=42)
engineered_model = None  # Replace with trained model

# Make predictions
# TODO: Generate predictions on test set
engineered_predictions = None  # Replace with predictions

# Calculate metrics
# TODO: Calculate R² and RMSE
engineered_r2 = None  # Replace with r2_score()
engineered_rmse = None  # Replace with np.sqrt(mean_squared_error())

print("\n" + "="*80)
print("ENGINEERED MODEL RESULTS")
print("="*80)
print(f"R² Score: {engineered_r2 if engineered_r2 is not None else 'Not calculated'}")
print(f"RMSE: ${engineered_rmse:,.2f}" if engineered_rmse is not None else "RMSE: Not calculated")
print("="*80)

---
## Step 5: Compare Models and Identify Most Valuable Features

### Create Comparison Table

In [None]:
# Create comparison DataFrame
# TODO: Create a table comparing baseline vs engineered model
comparison = None  # Replace with pd.DataFrame()

print("\n" + "="*80)
print("MODEL COMPARISON")
print("="*80)
# TODO: Display comparison table

print("="*80)

# Calculate improvement
if baseline_r2 is not None and engineered_r2 is not None:
    r2_improvement = ((engineered_r2 - baseline_r2) / baseline_r2) * 100
    rmse_improvement = ((baseline_rmse - engineered_rmse) / baseline_rmse) * 100
    print(f"\nR² Improvement: {r2_improvement:.2f}%")
    print(f"RMSE Improvement: {rmse_improvement:.2f}%")

### Visualize Feature Importances from Engineered Model

In [None]:
# Extract and visualize top 15 feature importances
# TODO: Get feature importances from engineered_model
# TODO: Create horizontal bar plot of top 15 features



### Analysis: Most Valuable Features

**Write 3-5 bullet points analyzing your results:**

- [Which of YOUR engineered features appeared in the top 15 most important features?]
- [Why do you think these specific features performed well?]
- [Were any engineered features less valuable than you expected? Why?]
- [What did you learn about feature engineering from this analysis?]
- [If you were to create more features, what would you try based on these results?]

---
## Step 6: Submit Your Work

Before submitting:
1. Make sure all code cells run without errors
2. Verify you have at least 5 engineered features with business justifications
3. Check that your comparison table and visualizations display correctly
4. Complete the analysis section above

Then push to GitHub:
```bash
git add .
git commit -m 'completed feature engineering assignment'
git push
```

Submit your GitHub repository link on the course platform.