# Feature Engineering

Feature engineering is the process of creating new variables and transforming existing ones to improve the performance of machine learning models. This involves domain knowledge and creativity to extract useful information from raw data. In this notebook, we will explore the following concepts:

- Creating new variables
- Transforming existing variables

## 1. Creating New Variables

Creating new variables involves generating additional features that can provide more information to the model. These new variables are derived from the existing ones and can help improve model accuracy and interpretability. Here are several examples of creating new variables in different domains:

### Example 1: Finance

- **Existing Variables**: Transaction amount, transaction date
- **New Variables**: 
  - Monthly transaction count
  - Average transaction amount per month
  - Transaction amount deviation

### Example 2: Healthcare

- **Existing Variables**: Patient age, weight, height
- **New Variables**: 
  - Body Mass Index (BMI)
  - Age group (e.g., child, adult, senior)
  - Weight change over time

### Example 3: E-commerce

- **Existing Variables**: Product price, product category, user rating
- **New Variables**: 
  - Price after discount
  - Popularity score (based on user rating and number of reviews)
  - Seasonal sales trend

## 2. Transforming Existing Variables

Transforming existing variables involves modifying the original features to better represent the underlying data. This can include scaling, encoding, and aggregating data. Here are some common transformations:

- **Scaling**: Normalizing or standardizing numerical variables to have a consistent scale.
- **Encoding**: Converting categorical variables into numerical format using techniques such as one-hot encoding or label encoding.
- **Aggregating**: Summarizing data at a higher level, such as calculating the total sales per month from daily sales data.

## Importance of Domain Expertise

Feature engineering often requires domain knowledge to ensure that the new and transformed variables are meaningful and useful. Collaboration with domain experts can provide valuable insights and help in creating features that are coherent and relevant to the specific application.

In the following sections, we will implement these concepts using Python.


In [1]:
# Import necessary libraries
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler, OneHotEncoder

# Example dataset
data = {
    'transaction_amount': [100, 200, 150, 300, 250, 50, 400],
    'transaction_date': pd.date_range(start='2021-01-01', periods=7, freq='M'),
    'patient_age': [25, 45, 30, 35, 60, 50, 40],
    'weight': [70, 80, 75, 85, 90, 95, 65],
    'height': [1.75, 1.80, 1.65, 1.70, 1.60, 1.85, 1.90],
    'product_price': [20, 30, 25, 40, 35, 15, 50],
    'product_category': ['A', 'B', 'A', 'B', 'A', 'B', 'A'],
    'user_rating': [4, 5, 3, 4, 5, 2, 5]
}

# Convert to DataFrame
df = pd.DataFrame(data)

In [2]:
# Display original dataset
print("Original Dataset:")
print(df)

Original Dataset:
   transaction_amount transaction_date  patient_age  weight  height  \
0                 100       2021-01-31           25      70    1.75   
1                 200       2021-02-28           45      80    1.80   
2                 150       2021-03-31           30      75    1.65   
3                 300       2021-04-30           35      85    1.70   
4                 250       2021-05-31           60      90    1.60   
5                  50       2021-06-30           50      95    1.85   
6                 400       2021-07-31           40      65    1.90   

   product_price product_category  user_rating  
0             20                A            4  
1             30                B            5  
2             25                A            3  
3             40                B            4  
4             35                A            5  
5             15                B            2  
6             50                A            5  


In [3]:
# Finance: Monthly transaction count and average transaction amount per month
df['transaction_month'] = df['transaction_date'].dt.to_period('M')
monthly_transaction_count = df.groupby('transaction_month')['transaction_amount'].count().reset_index(name='monthly_transaction_count')
average_transaction_amount = df.groupby('transaction_month')['transaction_amount'].mean().reset_index(name='average_transaction_amount')
transaction_amount_deviation = df.groupby('transaction_month')['transaction_amount'].std().reset_index(name='transaction_amount_deviation')

In [4]:
# Healthcare: Body Mass Index (BMI) and age group
df['BMI'] = df['weight'] / (df['height'] ** 2)
df['age_group'] = pd.cut(df['patient_age'], bins=[0, 18, 35, 55, 100], labels=['child', 'adult', 'middle_age', 'senior'])

In [5]:
# E-commerce: Price after discount and popularity score
df['discount'] = [0.1, 0.2, 0.15, 0.25, 0.1, 0.3, 0.2]
df['price_after_discount'] = df['product_price'] * (1 - df['discount'])
df['popularity_score'] = df['user_rating'] * df['product_price']

In [6]:
# Display new variables
print("\nNew Variables:")
print(df)


New Variables:
   transaction_amount transaction_date  patient_age  weight  height  \
0                 100       2021-01-31           25      70    1.75   
1                 200       2021-02-28           45      80    1.80   
2                 150       2021-03-31           30      75    1.65   
3                 300       2021-04-30           35      85    1.70   
4                 250       2021-05-31           60      90    1.60   
5                  50       2021-06-30           50      95    1.85   
6                 400       2021-07-31           40      65    1.90   

   product_price product_category  user_rating transaction_month        BMI  \
0             20                A            4           2021-01  22.857143   
1             30                B            5           2021-02  24.691358   
2             25                A            3           2021-03  27.548209   
3             40                B            4           2021-04  29.411765   
4             35    

In [7]:
# Scaling numerical variables
scaler = StandardScaler()
df[['transaction_amount_scaled', 'BMI_scaled']] = scaler.fit_transform(df[['transaction_amount', 'BMI']])

In [8]:
# Encoding categorical variables
encoder = OneHotEncoder(sparse=False)
encoded_categories = encoder.fit_transform(df[['product_category']])
encoded_category_df = pd.DataFrame(encoded_categories, columns=encoder.get_feature_names_out(['product_category']))
df = pd.concat([df, encoded_category_df], axis=1)

In [9]:
# Aggregating data: Total sales per month
total_sales_per_month = df.groupby('transaction_month')['transaction_amount'].sum().reset_index(name='total_sales')

In [10]:
# Display transformed variables
print("\nTransformed Variables:")
print(df)


Transformed Variables:
   transaction_amount transaction_date  patient_age  weight  height  \
0                 100       2021-01-31           25      70    1.75   
1                 200       2021-02-28           45      80    1.80   
2                 150       2021-03-31           30      75    1.65   
3                 300       2021-04-30           35      85    1.70   
4                 250       2021-05-31           60      90    1.60   
5                  50       2021-06-30           50      95    1.85   
6                 400       2021-07-31           40      65    1.90   

   product_price product_category  user_rating transaction_month        BMI  \
0             20                A            4           2021-01  22.857143   
1             30                B            5           2021-02  24.691358   
2             25                A            3           2021-03  27.548209   
3             40                B            4           2021-04  29.411765   
4           

In [11]:
# Display aggregated data
print("\nAggregated Data:")
print(total_sales_per_month)


Aggregated Data:
  transaction_month  total_sales
0           2021-01          100
1           2021-02          200
2           2021-03          150
3           2021-04          300
4           2021-05          250
5           2021-06           50
6           2021-07          400
