# üé® Feature Engineering Playbook
## Making Your Data Better for Machine Learning!

### What is Feature Engineering? ü§î

Imagine you're making a pizza. You have ingredients (flour, cheese, tomatoes), but you need to **prepare** them first:
- Cut the tomatoes into slices
- Grate the cheese
- Mix and knead the dough

**Feature Engineering** is like preparing ingredients for your ML model! We take raw data and transform it into features that help our model learn better.

---

### What You'll Learn Today:
1. üîç Understanding Features
2. üî¢ Handling Missing Data
3. üè∑Ô∏è Encoding Categorical Variables
4. üìè Feature Scaling
5. ‚ú® Creating New Features
6. üéØ Feature Selection

Let's get started! üöÄ

In [None]:
# Import our tools
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import StandardScaler, MinMaxScaler, LabelEncoder, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import PolynomialFeatures
from sklearn.feature_selection import SelectKBest, f_classif

# Make plots look nice
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette("husl")

print("‚úÖ All libraries imported successfully!")

---
## 1. üîç Understanding Features

**Features** are the characteristics we use to describe our data.

**Example:** Describing a video game character:
- **Height:** 180 cm (numerical)
- **Strength:** 85/100 (numerical)
- **Class:** Warrior (categorical)
- **Has_Magic:** Yes (binary)

Let's create a sample dataset about students!

In [None]:
# Creating a student dataset
data = {
    'Student_ID': range(1, 16),
    'Study_Hours': [2, 4, 1, 5, 3, np.nan, 4, 2, 5, 3, 4, 1, np.nan, 3, 5],
    'Sleep_Hours': [7, 8, 6, 7, np.nan, 8, 7, 6, 8, 7, 8, 5, 7, 6, 8],
    'Favorite_Subject': ['Math', 'Science', 'Math', 'English', 'Science', 'Math',
                         'English', 'Math', 'Science', 'English', 'Math', 'Science',
                         'English', 'Math', 'Science'],
    'Has_Pet': ['Yes', 'No', 'Yes', 'Yes', 'No', 'Yes', 'No', 'Yes', 'Yes', 'No',
                'Yes', 'No', 'Yes', 'No', 'Yes'],
    'Test_Score': [85, 92, 78, 95, 88, 82, 90, 80, 96, 87, 91, 75, 89, 83, 94]
}

df = pd.DataFrame(data)
print("üìä Our Student Dataset:")
print(df)
print("\nüìà Dataset Info:")
print(df.info())

---
## 2. üî¢ Handling Missing Data

Sometimes data is missing - like when a student forgets to fill in a survey question!

**Common Strategies:**
1. **Fill with mean** - Average value
2. **Fill with median** - Middle value
3. **Fill with mode** - Most common value
4. **Remove the row** - Delete incomplete data

Let's see the missing values in our data:

In [None]:
# Check for missing values
print("‚ùå Missing Values:")
print(df.isnull().sum())

# Visualize missing data
plt.figure(figsize=(10, 4))
sns.heatmap(df.isnull(), cbar=False, cmap='coolwarm', yticklabels=False)
plt.title('Missing Data Visualization (Yellow = Missing)', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()

In [None]:
# Strategy 1: Fill missing values with MEAN
imputer_mean = SimpleImputer(strategy='mean')

# Create a copy of our dataframe
df_filled = df.copy()

# Fill missing Study_Hours with mean
df_filled[['Study_Hours']] = imputer_mean.fit_transform(df[['Study_Hours']])

# Fill missing Sleep_Hours with mean
df_filled[['Sleep_Hours']] = imputer_mean.fit_transform(df[['Sleep_Hours']])

print("‚úÖ After filling missing values with MEAN:")
print(df_filled[['Study_Hours', 'Sleep_Hours']])

print("\nüéØ No more missing values!")
print(df_filled.isnull().sum())

### üéÆ Try It Yourself!

**Challenge:** Try filling missing values with `median` instead of `mean`. What's the difference?

In [None]:
# YOUR TURN: Fill with median
imputer_median = SimpleImputer(strategy='median')

# TODO: Fill missing values with median
# df_median = ...


---
## 3. üè∑Ô∏è Encoding Categorical Variables

Machine learning models speak the language of **numbers**, not words!

We need to convert categories (like "Math", "Science") into numbers.

### Two Main Techniques:

#### A) **Label Encoding** - Assign numbers
- Math ‚Üí 0
- Science ‚Üí 1
- English ‚Üí 2

#### B) **One-Hot Encoding** - Create binary columns
- Is_Math: 1 or 0
- Is_Science: 1 or 0
- Is_English: 1 or 0

In [None]:
# Label Encoding for Has_Pet (Binary: Yes/No)
label_encoder = LabelEncoder()
df_filled['Has_Pet_Encoded'] = label_encoder.fit_transform(df_filled['Has_Pet'])

print("üè∑Ô∏è Label Encoding for Has_Pet:")
print(df_filled[['Has_Pet', 'Has_Pet_Encoded']].head(10))
print("\nEncoding: No = 0, Yes = 1")

In [None]:
# One-Hot Encoding for Favorite_Subject
subject_encoded = pd.get_dummies(df_filled['Favorite_Subject'], prefix='Subject')

print("üéØ One-Hot Encoding for Favorite_Subject:")
print(subject_encoded.head(10))

# Add to our dataframe
df_filled = pd.concat([df_filled, subject_encoded], axis=1)
print("\n‚úÖ Updated DataFrame with encoded columns:")
print(df_filled.head())

In [None]:
# Visualize the encoding
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Original categories
df_filled['Favorite_Subject'].value_counts().plot(kind='bar', ax=axes[0], color='skyblue')
axes[0].set_title('Original Categories', fontsize=14, fontweight='bold')
axes[0].set_ylabel('Count')
axes[0].set_xlabel('Subject')

# One-hot encoded
subject_encoded.sum().plot(kind='bar', ax=axes[1], color='coral')
axes[1].set_title('One-Hot Encoded Columns', fontsize=14, fontweight='bold')
axes[1].set_ylabel('Count')
axes[1].set_xlabel('Encoded Column')

plt.tight_layout()
plt.show()

---
## 4. üìè Feature Scaling

Imagine comparing:
- **Student's height:** 150 cm
- **Student's age:** 13 years

The numbers are on different scales! ML models get confused.

**Solution:** Put all features on the same scale!

### Two Popular Methods:

#### 1) **Standardization (Z-score)** - Centers around 0
Formula: `(x - mean) / standard_deviation`

#### 2) **Normalization (Min-Max)** - Scales to 0-1
Formula: `(x - min) / (max - min)`

In [None]:
# Let's look at our numerical features
numerical_features = ['Study_Hours', 'Sleep_Hours', 'Test_Score']

print("üìä Original Values:")
print(df_filled[numerical_features].describe())

# Visualize original distributions
fig, axes = plt.subplots(1, 3, figsize=(15, 4))
for idx, col in enumerate(numerical_features):
    axes[idx].hist(df_filled[col], bins=10, color='lightblue', edgecolor='black')
    axes[idx].set_title(f'Original {col}', fontweight='bold')
    axes[idx].set_xlabel(col)
    axes[idx].set_ylabel('Frequency')
plt.tight_layout()
plt.show()

In [None]:
# Standardization (StandardScaler)
scaler_standard = StandardScaler()
df_standardized = df_filled.copy()
df_standardized[numerical_features] = scaler_standard.fit_transform(df_filled[numerical_features])

print("üìè After Standardization (mean=0, std=1):")
print(df_standardized[numerical_features].describe())

# Normalization (MinMaxScaler)
scaler_minmax = MinMaxScaler()
df_normalized = df_filled.copy()
df_normalized[numerical_features] = scaler_minmax.fit_transform(df_filled[numerical_features])

print("\nüéØ After Normalization (range: 0-1):")
print(df_normalized[numerical_features].describe())

In [None]:
# Compare all three side by side
fig, axes = plt.subplots(3, 3, figsize=(15, 12))

for idx, col in enumerate(numerical_features):
    # Original
    axes[0, idx].hist(df_filled[col], bins=10, color='lightblue', edgecolor='black')
    axes[0, idx].set_title(f'Original {col}', fontweight='bold')

    # Standardized
    axes[1, idx].hist(df_standardized[col], bins=10, color='lightgreen', edgecolor='black')
    axes[1, idx].set_title(f'Standardized {col}', fontweight='bold')

    # Normalized
    axes[2, idx].hist(df_normalized[col], bins=10, color='lightcoral', edgecolor='black')
    axes[2, idx].set_title(f'Normalized {col}', fontweight='bold')

plt.tight_layout()
plt.show()

---
## 5. ‚ú® Creating New Features

Sometimes we can **create** better features by combining existing ones!

**Examples:**
- **Total_Hours** = Study_Hours + Sleep_Hours
- **Study_Sleep_Ratio** = Study_Hours / Sleep_Hours
- **Is_High_Scorer** = 1 if Test_Score > 90, else 0

In [None]:
# Create new features
df_features = df_filled.copy()

# 1. Total Hours (combination)
df_features['Total_Hours'] = df_features['Study_Hours'] + df_features['Sleep_Hours']

# 2. Study-Sleep Ratio
df_features['Study_Sleep_Ratio'] = df_features['Study_Hours'] / df_features['Sleep_Hours']

# 3. Binary feature: Is High Scorer?
df_features['Is_High_Scorer'] = (df_features['Test_Score'] >= 90).astype(int)

# 4. Study Hours Squared (polynomial feature)
df_features['Study_Hours_Squared'] = df_features['Study_Hours'] ** 2

print("‚ú® New Features Created:")
print(df_features[['Study_Hours', 'Sleep_Hours', 'Total_Hours',
                    'Study_Sleep_Ratio', 'Test_Score', 'Is_High_Scorer',
                    'Study_Hours_Squared']].head(10))

In [None]:
# Visualize the relationship between new features and test scores
fig, axes = plt.subplots(2, 2, figsize=(14, 10))

# Total Hours vs Test Score
axes[0, 0].scatter(df_features['Total_Hours'], df_features['Test_Score'],
                   c='blue', alpha=0.6, s=100)
axes[0, 0].set_xlabel('Total Hours (Study + Sleep)', fontweight='bold')
axes[0, 0].set_ylabel('Test Score', fontweight='bold')
axes[0, 0].set_title('Total Hours vs Test Score', fontsize=12, fontweight='bold')
axes[0, 0].grid(True, alpha=0.3)

# Study-Sleep Ratio vs Test Score
axes[0, 1].scatter(df_features['Study_Sleep_Ratio'], df_features['Test_Score'],
                   c='green', alpha=0.6, s=100)
axes[0, 1].set_xlabel('Study-Sleep Ratio', fontweight='bold')
axes[0, 1].set_ylabel('Test Score', fontweight='bold')
axes[0, 1].set_title('Study-Sleep Ratio vs Test Score', fontsize=12, fontweight='bold')
axes[0, 1].grid(True, alpha=0.3)

# High Scorers Distribution
df_features['Is_High_Scorer'].value_counts().plot(kind='bar', ax=axes[1, 0],
                                                    color=['coral', 'lightblue'])
axes[1, 0].set_xlabel('Is High Scorer (0=No, 1=Yes)', fontweight='bold')
axes[1, 0].set_ylabel('Count', fontweight='bold')
axes[1, 0].set_title('Distribution of High Scorers', fontsize=12, fontweight='bold')
axes[1, 0].set_xticklabels(['No (0)', 'Yes (1)'], rotation=0)

# Study Hours Squared vs Test Score
axes[1, 1].scatter(df_features['Study_Hours_Squared'], df_features['Test_Score'],
                   c='purple', alpha=0.6, s=100)
axes[1, 1].set_xlabel('Study Hours Squared', fontweight='bold')
axes[1, 1].set_ylabel('Test Score', fontweight='bold')
axes[1, 1].set_title('Study Hours¬≤ vs Test Score', fontsize=12, fontweight='bold')
axes[1, 1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

### üßÆ Polynomial Features

Sometimes relationships aren't linear (straight lines)!

Polynomial features help capture **curved** relationships.

In [None]:
# Create polynomial features
poly = PolynomialFeatures(degree=2, include_bias=False)

# Use just Study_Hours and Sleep_Hours
features_for_poly = df_filled[['Study_Hours', 'Sleep_Hours']]
poly_features = poly.fit_transform(features_for_poly)

# Get feature names
poly_feature_names = poly.get_feature_names_out(['Study_Hours', 'Sleep_Hours'])

print("üßÆ Polynomial Features (degree=2):")
print(f"Original features: {list(features_for_poly.columns)}")
print(f"\nNew polynomial features: {list(poly_feature_names)}")
print(f"\nNumber of features increased from {features_for_poly.shape[1]} to {poly_features.shape[1]}!")

# Show first few rows
poly_df = pd.DataFrame(poly_features, columns=poly_feature_names)
print("\nüìä Sample Polynomial Features:")
print(poly_df.head())

---
## 6. üéØ Feature Selection

**Too many features?** Not all features are helpful!

Feature selection helps us pick the **most important** features.

**Why?**
- Faster training
- Better performance
- Easier to understand

Let's find which features best predict test scores!

In [None]:
# Prepare features for selection
# We'll use numerical and encoded features
feature_columns = ['Study_Hours', 'Sleep_Hours', 'Has_Pet_Encoded',
                   'Subject_English', 'Subject_Math', 'Subject_Science']

X = df_filled[feature_columns]
y = df_filled['Test_Score']

print("üìã Our Features:")
print(X.head())
print("\nüéØ Target (What we're predicting):")
print(y.head())

In [None]:
# Convert to binary classification for feature selection
# High scorer = 1, Not high scorer = 0
y_binary = (y >= 90).astype(int)

# Select K Best features using f_classif
selector = SelectKBest(score_func=f_classif, k=3)  # Select top 3 features
X_selected = selector.fit_transform(X, y_binary)

# Get feature scores
feature_scores = pd.DataFrame({
    'Feature': feature_columns,
    'Score': selector.scores_
}).sort_values('Score', ascending=False)

print("üèÜ Feature Importance Scores:")
print(feature_scores)

# Get selected features
selected_features = [feature_columns[i] for i in selector.get_support(indices=True)]
print(f"\n‚úÖ Top {selector.k} Selected Features: {selected_features}")

In [None]:
# Visualize feature importance
plt.figure(figsize=(12, 6))
colors = ['gold' if i < 3 else 'lightgray' for i in range(len(feature_scores))]
plt.barh(feature_scores['Feature'], feature_scores['Score'], color=colors)
plt.xlabel('Importance Score', fontsize=12, fontweight='bold')
plt.ylabel('Features', fontsize=12, fontweight='bold')
plt.title('Feature Importance for Predicting High Test Scores',
          fontsize=14, fontweight='bold')
plt.axvline(x=feature_scores['Score'].iloc[2], color='red',
            linestyle='--', label='Selection Threshold')
plt.legend()
plt.tight_layout()
plt.show()

print("\nüí° Gold bars = Selected features!")
print("Gray bars = Not selected (less important)")

---
## 7. ü§ñ Putting It All Together: Build a Model!

Let's see how feature engineering improves our model!

We'll compare:
1. **Raw data** (no feature engineering)
2. **Engineered features** (with all our transformations)

In [None]:
# Prepare data for modeling
# Model 1: Using RAW features (just Study_Hours and Sleep_Hours)
X_raw = df_filled[['Study_Hours', 'Sleep_Hours']].copy()
y_target = (df_filled['Test_Score'] >= 90).astype(int)  # Binary: High scorer or not

# Split data
X_train_raw, X_test_raw, y_train, y_test = train_test_split(
    X_raw, y_target, test_size=0.3, random_state=42
)

# Train model with raw features
model_raw = LogisticRegression(random_state=42)
model_raw.fit(X_train_raw, y_train)
y_pred_raw = model_raw.predict(X_test_raw)
accuracy_raw = accuracy_score(y_test, y_pred_raw)

print("üìä Model 1: Using RAW Features")
print(f"Features used: {list(X_raw.columns)}")
print(f"Accuracy: {accuracy_raw:.2%}")

In [None]:
# Model 2: Using ENGINEERED features
# Create engineered features
X_engineered = df_filled[['Study_Hours', 'Sleep_Hours']].copy()

# Add new features
X_engineered['Total_Hours'] = X_engineered['Study_Hours'] + X_engineered['Sleep_Hours']
X_engineered['Study_Sleep_Ratio'] = X_engineered['Study_Hours'] / X_engineered['Sleep_Hours']
X_engineered['Study_Hours_Squared'] = X_engineered['Study_Hours'] ** 2

# Add encoded categorical features
X_engineered['Has_Pet'] = df_filled['Has_Pet_Encoded']
X_engineered = pd.concat([X_engineered, subject_encoded], axis=1)

# Scale features
scaler = StandardScaler()
X_engineered_scaled = pd.DataFrame(
    scaler.fit_transform(X_engineered),
    columns=X_engineered.columns
)

# Split data
X_train_eng, X_test_eng, y_train_eng, y_test_eng = train_test_split(
    X_engineered_scaled, y_target, test_size=0.3, random_state=42
)

# Train model with engineered features
model_engineered = LogisticRegression(random_state=42, max_iter=1000)
model_engineered.fit(X_train_eng, y_train_eng)
y_pred_eng = model_engineered.predict(X_test_eng)
accuracy_eng = accuracy_score(y_test_eng, y_pred_eng)

print("\n‚ú® Model 2: Using ENGINEERED Features")
print(f"Features used: {list(X_engineered.columns)}")
print(f"Accuracy: {accuracy_eng:.2%}")

In [None]:
# Compare the two models
comparison = pd.DataFrame({
    'Model': ['Raw Features', 'Engineered Features'],
    'Number of Features': [X_raw.shape[1], X_engineered.shape[1]],
    'Accuracy': [accuracy_raw, accuracy_eng],
    'Improvement': [0, accuracy_eng - accuracy_raw]
})

print("\nüìä Model Comparison:")
print(comparison)

# Visualize comparison
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Accuracy comparison
models = ['Raw Features', 'Engineered Features']
accuracies = [accuracy_raw, accuracy_eng]
colors_bar = ['lightcoral', 'lightgreen']

axes[0].bar(models, accuracies, color=colors_bar, edgecolor='black', linewidth=2)
axes[0].set_ylabel('Accuracy', fontsize=12, fontweight='bold')
axes[0].set_title('Model Performance Comparison', fontsize=14, fontweight='bold')
axes[0].set_ylim([0, 1.1])
for i, v in enumerate(accuracies):
    axes[0].text(i, v + 0.02, f'{v:.2%}', ha='center', fontweight='bold', fontsize=12)

# Feature count comparison
feature_counts = [X_raw.shape[1], X_engineered.shape[1]]
axes[1].bar(models, feature_counts, color=colors_bar, edgecolor='black', linewidth=2)
axes[1].set_ylabel('Number of Features', fontsize=12, fontweight='bold')
axes[1].set_title('Feature Count Comparison', fontsize=14, fontweight='bold')
for i, v in enumerate(feature_counts):
    axes[1].text(i, v + 0.2, str(v), ha='center', fontweight='bold', fontsize=12)

plt.tight_layout()
plt.show()

print("\nüéâ Feature Engineering can improve model performance!")

---
## 8. üéì Summary: Your Feature Engineering Toolkit

### What We Learned:

| Technique | What It Does | When To Use |
|-----------|--------------|-------------|
| **Handling Missing Data** | Fill or remove missing values | When you have incomplete data |
| **Label Encoding** | Convert categories to numbers | For binary or ordinal categories |
| **One-Hot Encoding** | Create binary columns for categories | For nominal categories |
| **Standardization** | Scale features to mean=0, std=1 | When features have different units |
| **Normalization** | Scale features to 0-1 range | When you want bounded values |
| **Feature Creation** | Combine or transform features | To capture new patterns |
| **Polynomial Features** | Create interaction terms | For non-linear relationships |
| **Feature Selection** | Pick most important features | To reduce complexity |

### üîë Key Takeaways:

1. **Feature Engineering is an art AND a science** - experiment with different transformations!
2. **More features ‚â† Better model** - quality over quantity!
3. **Always scale your features** when using distance-based algorithms
4. **Understand your data first** - look at distributions, missing values, correlations
5. **Test different approaches** - compare raw vs engineered features

### üöÄ Next Steps:

1. Try these techniques on your own datasets!
2. Experiment with creating domain-specific features
3. Learn about advanced techniques like:
   - Target encoding
   - Feature hashing
   - Dimensionality reduction (PCA)
   - Automatic feature engineering tools

---
## üéÆ Practice Challenges

### Challenge 1: Create Your Own Features
Create 3 new features from the student dataset and test if they improve the model!

### Challenge 2: Different Scalers
Compare StandardScaler vs MinMaxScaler vs RobustScaler. Which works best?

### Challenge 3: Feature Engineering Pipeline
Build a complete feature engineering pipeline that:
1. Handles missing values
2. Encodes categories
3. Creates new features
4. Scales data
5. Selects best features

### Challenge 4: Real Dataset
Apply these techniques to a real dataset (like Titanic, House Prices, or Iris)!

In [None]:
# YOUR PRACTICE SPACE - Try the challenges here!

# Challenge 1: Create your own features


# Challenge 2: Compare different scalers


# Challenge 3: Build a complete pipeline


# Challenge 4: Try on a real dataset


---
## üìö Additional Resources

### Learn More:
- [Scikit-learn Preprocessing Guide](https://scikit-learn.org/stable/modules/preprocessing.html)
- [Feature Engineering for Machine Learning](https://www.kaggle.com/learn/feature-engineering)
- [Pandas Documentation](https://pandas.pydata.org/docs/)

### Practice Datasets:
- [Kaggle Datasets](https://www.kaggle.com/datasets)
- [UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/index.php)
- [Scikit-learn Toy Datasets](https://scikit-learn.org/stable/datasets/toy_dataset.html)

---

## üéâ Congratulations!

You've completed the Feature Engineering Playbook! You now know how to:
- ‚úÖ Handle missing data
- ‚úÖ Encode categorical variables
- ‚úÖ Scale features
- ‚úÖ Create new features
- ‚úÖ Select important features
- ‚úÖ Build better ML models

**Keep practicing and experimenting! üöÄ**

---

*Created with ‚ù§Ô∏è for middle school ML students*