# Data Engineering – Exercises and Solutions

## Lecture 8 - Feature Engineering and Vectorization


### Exercise 1: Categorical Feature Encoding
Given the dataset below:  
```python
data = [
    {'price': 850000, 'rooms': 4, 'neighborhood': 'Queen Anne'},
    {'price': 700000, 'rooms': 3, 'neighborhood': 'Fremont'},
    {'price': 650000, 'rooms': 3, 'neighborhood': 'Wallingford'},
    {'price': 600000, 'rooms': 2, 'neighborhood': 'Fremont'}
]
```
1. Convert the 'neighborhood' categorical feature into numerical values using **One-Hot Encoding**.  
2. Create a DataFrame that represents this transformation.  


In [None]:

import pandas as pd

data = [
    {'price': 850000, 'rooms': 4, 'neighborhood': 'Queen Anne'},
    {'price': 700000, 'rooms': 3, 'neighborhood': 'Fremont'},
    {'price': 650000, 'rooms': 3, 'neighborhood': 'Wallingford'},
    {'price': 600000, 'rooms': 2, 'neighborhood': 'Fremont'}
]

df = pd.DataFrame(data)
encoded_df = pd.get_dummies(df, columns=['neighborhood'])
encoded_df



### Exercise 2: Imputation of Missing Data
Expand the previous dataset by introducing missing values:  
```python
data = [
    {'price': 850000, 'rooms': 4, 'neighborhood': 'Queen Anne'},
    {'price': 700000, 'rooms': 3, 'neighborhood': None},
    {'price': None, 'rooms': 3, 'neighborhood': 'Wallingford'},
    {'price': 600000, 'rooms': 2, 'neighborhood': 'Fremont'}
]
```  
1. Impute missing values in the 'price' column using the **mean** of the available prices.  
2. Impute missing values in the 'neighborhood' column by filling forward (using the last valid observation).  


In [None]:

df.loc[2, 'price'] = None
df.loc[1, 'neighborhood'] = None

# Impute price
df['price'].fillna(df['price'].mean(), inplace=True)

# Fill forward for neighborhood
df['neighborhood'].fillna(method='ffill', inplace=True)

df



### Exercise 3: Feature Scaling
For the imputed dataset:  
1. Scale the 'price' and 'rooms' columns using **Min-Max Scaling**.  
2. Create a new DataFrame representing the scaled features.  


In [None]:

from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
scaled_data = scaler.fit_transform(df[['price', 'rooms']])
scaled_df = pd.DataFrame(scaled_data, columns=['scaled_price', 'scaled_rooms'])

scaled_df



### Exercise 4: Derived Features  
1. Create a new feature called 'price_per_room' by dividing 'price' by 'rooms'.  
2. Add this feature to the DataFrame and display the updated DataFrame.


In [None]:

df['price_per_room'] = df['price'] / df['rooms']
df



### Exercise 5: Binning  
1. Bin the 'price' column into 3 categories: Low, Medium, High.  
2. Use pandas' `cut` method to assign appropriate labels.  
3. Display the resulting DataFrame with the new 'price_category' column.  


In [None]:

bins = [0, 650000, 750000, df['price'].max()]
labels = ['Low', 'Medium', 'High']

df['price_category'] = pd.cut(df['price'], bins=bins, labels=labels)
df



### Exercise 6: Polynomial Features  
1. Generate polynomial features up to the 2nd degree for 'rooms' and 'price'.  
2. Use `PolynomialFeatures` from `sklearn.preprocessing`.  
3. Display the transformed DataFrame.


In [None]:

from sklearn.preprocessing import PolynomialFeatures

poly = PolynomialFeatures(degree=2, include_bias=False)
poly_features = poly.fit_transform(df[['rooms', 'price']])
poly_df = pd.DataFrame(poly_features, columns=poly.get_feature_names_out(['rooms', 'price']))

poly_df
