# Lektion 15 - Feature Engineering

## Syfte och mål
This is the most crucial step in data preparation. You will learn to transform raw data (like text or timestamps) into useful numerical features. The goal is to master techniques for handling date/time data and converting non-numerical variables into a format suitable for both statistical analysis and machine learning.

## Learning Objectives
- Understand what feature engineering is and why it matters more than the ML algorithm itself
- Master handling categorical data using One-Hot Encoding and Label Encoding
- Learn to extract meaningful features from date/time data
- Create derived features from existing variables
- Apply text feature engineering techniques


## Pre-class Reading

### Korta läshänvisningar:
- **The Art of Feature Engineering**: [Feature Engineering Guide](https://machinelearningmastery.com/the-concise-guide-to-feature-engineering-for-better-model-performance/)
- **Handling Categorical Data**: [Encoding Categorical Variables](https://medium.com/@jaberi.mohamedhabib/encoding-categorical-variables-methods-and-techniques-in-pandas-scikit-learn-and-using-dummy-216ae2d5128d)

### Fördjupning / Längre läshänvisningar:
- **Date and Time Features**: [Working with Date and Time using Pandas](https://www.geeksforgeeks.org/python/python-working-with-date-and-time-using-pandas/)
- **Text Feature Basics**: [Feature Engineering in NLP](https://www.analyticsvidhya.com/blog/2021/04/a-guide-to-feature-engineering-in-nlp/)

### Videos:
- Feature Engineering Examples: Practical, real-world examples of creating new features from raw data
- Encoding Categorical Variables: Visual tutorial explaining the difference between One-Hot and Label Encoding


## Classroom Activities

### Aktivitet 1: Discussion - Raw Data to Insight (15 min)
Start with a discussion on how a raw data column (e.g., a timestamp) can contain several valuable features.

**Questions:**
- "If you have a column named Registration_Date, what are three new features you could create from it?"
- "Why must we convert a categorical column like 'Color' into numbers before performing statistical analysis or training a model?"

### Aktivitet 2: Lab - Time and Categorical Transformation
Students will work with a dataset containing text and date columns to practice key transformation methods.


In [24]:
# Import necessary libraries
import pandas as pd
import numpy as np
from datetime import datetime


In [61]:
# 1. Create Data with Raw Features
data = {'Registration_Time': ['10:30:00 20251001','10:30:00 20251002', '2025-10-05 14:00:00', '2025-10-17 09:15:00'],
        'Product': ['Book', "Ebook" , 'Ebook', 'Book'],
        'Title': ['War and Peace','War and Peace', 'Gatsby', 'Moby Dick']}
df_raw = pd.DataFrame(data)

print("Original DataFrame:")
print(df_raw)
print(f"\nDataFrame shape: {df_raw.shape}")
print(f"Data types:\n{df_raw.dtypes}")


Original DataFrame:
     Registration_Time Product          Title
0    10:30:00 20251001    Book  War and Peace
1    10:30:00 20251002   Ebook  War and Peace
2  2025-10-05 14:00:00   Ebook         Gatsby
3  2025-10-17 09:15:00    Book      Moby Dick

DataFrame shape: (4, 3)
Data types:
Registration_Time    object
Product              object
Title                object
dtype: object


In [51]:
# With info we can also inspect types, dtypes also work

df_raw.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 3 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   Registration_Time  3 non-null      object
 1   Product            3 non-null      object
 2   Title              3 non-null      object
dtypes: object(3)
memory usage: 200.0+ bytes


### Time Feature Engineering


In [52]:
# With to_datetime(), we can transform datetime-looking data into 
# actual datetimes (which python understands as dates)

registration_time = pd.to_datetime(df_raw["Registration_Time"], format="mixed")
print(registration_time)

weekday = registration_time.dt.dayofweek
print(weekday)

# Once the column is appropriately typed we can explore methods on it
# registration_time.dt


0   2025-10-01 10:30:00
1   2025-10-05 14:00:00
2   2025-10-17 09:15:00
Name: Registration_Time, dtype: datetime64[ns]
0    2
1    6
2    4
Name: Registration_Time, dtype: int32


In [53]:
# 2. Time Feature Engineering
df_raw['Registration_Time'] = pd.to_datetime(df_raw['Registration_Time'], format = "mixed" )
df_raw['DayOfWeek'] = df_raw['Registration_Time'].dt.dayofweek  # 0=Monday, 6=Sunday
df_raw['Hour'] = df_raw['Registration_Time'].dt.hour
df_raw['Month'] = df_raw['Registration_Time'].dt.month
df_raw['Year'] = df_raw['Registration_Time'].dt.year
df_raw['DayOfYear'] = df_raw['Registration_Time'].dt.dayofyear

print("DataFrame with Time Features:")
print(df_raw)


DataFrame with Time Features:
    Registration_Time Product          Title  DayOfWeek  Hour  Month  Year  \
0 2025-10-01 10:30:00    Book  War and Peace          2    10     10  2025   
1 2025-10-05 14:00:00   Ebook         Gatsby          6    14     10  2025   
2 2025-10-17 09:15:00    Book      Moby Dick          4     9     10  2025   

   DayOfYear  
0        274  
1        278  
2        290  


### Text Feature Engineering


In [None]:
# 3. Text Feature Engineering

# For text features, it can often be useful to create features based on string properties
# Features like:
# - Length of the string
# - Number of words
# - Presence of specific keywords
# - Whether it starts with a capital letter
df_raw['Title_Length'] = df_raw['Title'].str.len()
df_raw['Word_Count'] = df_raw['Title'].str.split().str.len()
df_raw['Has_And'] = df_raw['Title'].str.contains('and', case=False).astype(int)
df_raw['Starts_With_Capital'] = df_raw['Title'].str[0].str.isupper().astype(int)

print("DataFrame with Text Features:")
print(df_raw)


DataFrame with Text Features:
    Registration_Time Product          Title  DayOfWeek  Hour  Month  Year  \
0 2025-10-01 10:30:00    Book  War and Peace          2    10     10  2025   
1 2025-10-05 14:00:00   Ebook         Gatsby          6    14     10  2025   
2 2025-10-17 09:15:00    Book      Moby Dick          4     9     10  2025   

   DayOfYear  Title_Length  Word_Count  Has_And  Starts_With_Capital  
0        274            13           3        1                    1  
1        278             6           1        0                    1  
2        290             9           2        0                    1  


### Categorical Encoding


In [55]:
# 4. Categorical Encoding (One-Hot)
print("Before One-Hot Encoding:")
print(f"Product column unique values: {df_raw['Product'].unique()}")

# One-Hot Encoding
product_dummies = pd.get_dummies(df_raw['Product'], prefix='Product')
print(f"\nOne-Hot Encoded Product column:")
print(product_dummies)

# Combine with original dataframe
df_processed = pd.concat([df_raw.drop('Product', axis=1), product_dummies], axis=1)
print(f"\nDataFrame after One-Hot Encoding:")
print(df_processed)


Before One-Hot Encoding:
Product column unique values: ['Book' 'Ebook']

One-Hot Encoded Product column:
   Product_Book  Product_Ebook
0          True          False
1         False           True
2          True          False

DataFrame after One-Hot Encoding:
    Registration_Time          Title  DayOfWeek  Hour  Month  Year  DayOfYear  \
0 2025-10-01 10:30:00  War and Peace          2    10     10  2025        274   
1 2025-10-05 14:00:00         Gatsby          6    14     10  2025        278   
2 2025-10-17 09:15:00      Moby Dick          4     9     10  2025        290   

   Title_Length  Word_Count  Has_And  Starts_With_Capital  Product_Book  \
0            13           3        1                    1          True   
1             6           1        0                    1         False   
2             9           2        0                    1          True   

   Product_Ebook  
0          False  
1           True  
2          False  


In [56]:
%pip install scikit-learn

Note: you may need to restart the kernel to use updated packages.


In [57]:
# Alternative: Label Encoding for ordinal categories
from sklearn.preprocessing import LabelEncoder

# Create a copy for label encoding example
df_label = df_raw.copy()

# Label Encoding
le = LabelEncoder()
df_label['Product_Label'] = le.fit_transform(df_label['Product'])

print("Label Encoding Example:")
print(f"Original Product values: {df_label['Product'].unique()}")
print(f"Encoded values: {df_label['Product_Label'].unique()}")
print(f"Label mapping: {dict(zip(le.classes_, le.transform(le.classes_)))}")
print(f"\nDataFrame with Label Encoding:")
print(df_label[['Product', 'Product_Label']])


Label Encoding Example:
Original Product values: ['Book' 'Ebook']
Encoded values: [0 1]
Label mapping: {'Book': np.int64(0), 'Ebook': np.int64(1)}

DataFrame with Label Encoding:
  Product  Product_Label
0    Book              0
1   Ebook              1
2    Book              0


### Creating Derived Features


In [58]:
# 5. Creating Derived Features
# Example: Creating a weekend indicator
df_processed['Is_Weekend'] = (df_processed['DayOfWeek'] >= 5).astype(int)

# Example: Creating a business hours indicator
df_processed['Is_Business_Hours'] = ((df_processed['Hour'] >= 9) & (df_processed['Hour'] <= 17)).astype(int)

# Example: Creating a complex feature combining multiple variables
df_processed['Title_Complexity'] = df_processed['Title_Length'] * df_processed['Word_Count']

print("DataFrame with Derived Features:")
print(df_processed[['DayOfWeek', 'Is_Weekend', 'Hour', 'Is_Business_Hours', 'Title_Length', 'Word_Count', 'Title_Complexity']])


DataFrame with Derived Features:
   DayOfWeek  Is_Weekend  Hour  Is_Business_Hours  Title_Length  Word_Count  \
0          2           0    10                  1            13           3   
1          6           1    14                  1             6           1   
2          4           0     9                  1             9           2   

   Title_Complexity  
0                39  
1                 6  
2                18  


## Assignment
**Complete the L15_assignment.py** - This assignment provides hands-on practice with feature engineering using an e-commerce dataset, including:
- Time feature engineering from order dates
- Text feature engineering from product names and addresses
- Categorical encoding strategies (One-Hot and Label Encoding)
- Derived features from existing variables
- Advanced feature engineering techniques
- Feature validation and documentation


### Aktivitet 3: Hands-on - Feature Ideation (30 min)
Students select 3 features in their project dataset that are currently in a raw state (e.g., a text description, a date column, or two related numerical columns) and implement a strategy to transform them into a predictive feature.

**Exercise:**
1. Identify 3 raw features in your dataset
2. For each feature, propose 2-3 new engineered features
3. Implement the transformations
4. Document the reasoning behind each feature

## Key Takeaways

1. **Feature Engineering is Crucial**: Often matters more than the ML algorithm choice
2. **Time Features**: Extract meaningful components (day, hour, month, etc.) from datetime columns
3. **Text Features**: Create numerical features from text (length, word count, patterns)
4. **Categorical Encoding**: 
   - One-Hot Encoding for nominal categories
   - Label Encoding for ordinal categories
5. **Derived Features**: Combine existing features to create more predictive variables
6. **Domain Knowledge**: Use your understanding of the problem to create meaningful features

## Best Practices

- **Start Simple**: Begin with basic transformations before complex ones
- **Validate Impact**: Always check if new features improve model performance
- **Avoid Data Leakage**: Ensure features don't use future information
- **Document Everything**: Keep track of how features were created
- **Test on Holdout**: Validate feature engineering on unseen data

## Next Steps
- Practice with your own datasets
- Experiment with different feature combinations
- Learn about feature selection techniques
- Move on to Lesson 16: Data Quality
