# Feature Engineering


What is Feature Engineering?
Feature engineering involves transforming raw data into a format that can better represent the underlying patterns in the data. It often requires domain knowledge and creativity to identify meaningful features that can enhance the predictive power of a model.

Why is Feature Engineering Important?

Raw data might not have features in the right format or scale for machine learning algorithms.
Feature engineering can help models capture complex relationships, improve accuracy, and reduce overfitting.
It allows the model to focus on relevant information, leading to better generalization.
Techniques in Feature Engineering:

Feature Creation:

Create new features based on domain knowledge.
Example: Extracting the month or day of the week from a timestamp.
Binning or Bucketing:

Group continuous values into bins to turn numerical data into categorical features.
Example: Converting age into age groups (0-18, 19-35, etc.).
One-Hot Encoding:

Convert categorical variables into binary values (0 or 1) for each category.
Example: Encoding "gender" as two binary variables: is_male and is_female.
Encoding Ordinal Variables:

Convert ordinal variables (categories with an inherent order) into numerical values.
Example: Converting "education level" from categorical to numerical values.
Text and NLP Features:

Extract features from text data, such as word frequency, TF-IDF, or sentiment scores.
Example: Extracting keywords from product reviews.
Aggregation and Grouping:

Create features by aggregating data across groups.
Example: Calculating average purchase amount for each customer.
Interaction Features:

Combine two or more features to create new interactions.
Example: Multiplying "number of products purchased" and "average product price."
Polynomial Features:

Generate higher-order features by raising existing features to a power.
Example: Adding squared or cubed versions of a numerical feature.
Time-Based Features:

Extract features like day of the week, hour of the day, or time since a specific event.
Example: Calculating days until the next holiday.
Feature Engineering Process:

Data Understanding:

Understand the data, its context, and domain-specific knowledge.
Feature Generation:

Brainstorm and create new features based on domain knowledge and intuition.
Feature Selection:

Evaluate the importance of each feature and select the most relevant ones.
Feature Transformation:

Apply scaling, normalization, or log transformation to ensure features are on a similar scale.
Model Building and Validation:

Train models using the engineered features and validate their performance.
Benefits of Effective Feature Engineering:

Enhanced model performance and accuracy.
Improved generalization to unseen data.
Reduction of data dimensionality and noise.
Remember that feature engineering is an iterative process that requires experimentation and a deep understanding of the data and the problem you're trying to solve. It's a crucial step to unlock the full potential of machine learning algorithms and make them more effective in making predictions or classifications.

1. Feature Creation:
Create new features based on domain knowledge or insights. For example, extracting the day of the week from a date could be valuable in understanding weekly patterns.

In [1]:
import pandas as pd

data = pd.DataFrame({'timestamp': ['2023-08-01', '2023-08-02', '2023-08-03']})
data['day_of_week'] = pd.to_datetime(data['timestamp']).dt.day_name()
print(data)


    timestamp day_of_week
0  2023-08-01     Tuesday
1  2023-08-02   Wednesday
2  2023-08-03    Thursday


2. Binning or Bucketing:
Group continuous values into bins to convert numerical data into categorical features. Useful for creating more interpretable features.

In [2]:
data = pd.DataFrame({'age': [25, 32, 47, 55, 60]})
bins = [0, 30, 40, 50, 100]
labels = ['young', 'mid-age', 'prime', 'senior']
data['age_group'] = pd.cut(data['age'], bins=bins, labels=labels)
print(data)


   age age_group
0   25     young
1   32   mid-age
2   47     prime
3   55    senior
4   60    senior


3. One-Hot Encoding:
Convert categorical variables into binary (0 or 1) columns for each category. Useful for algorithms that require numerical input.

In [3]:
data = pd.DataFrame({'gender': ['male', 'female', 'male', 'non-binary']})
encoded_data = pd.get_dummies(data, columns=['gender'], prefix=['is'])
print(encoded_data)


   is_female  is_male  is_non-binary
0      False     True          False
1       True    False          False
2      False     True          False
3      False    False           True


4. Encoding Ordinal Variables:
Convert ordinal categorical variables into numerical values that maintain the order.

In [4]:
data = pd.DataFrame({'education_level': ['high school', 'college', 'master', 'high school']})
education_mapping = {'high school': 1, 'college': 2, 'master': 3}
data['education_level_encoded'] = data['education_level'].map(education_mapping)
print(data)


  education_level  education_level_encoded
0     high school                        1
1         college                        2
2          master                        3
3     high school                        1


5. Text and NLP Features:
Extract features from text data, like word frequency, TF-IDF (Term Frequency-Inverse Document Frequency), or sentiment scores.

In [5]:
from sklearn.feature_extraction.text import CountVectorizer

data = pd.DataFrame({'text': ['great product', 'disappointed with service', 'loved the experience']})
vectorizer = CountVectorizer()
text_features = vectorizer.fit_transform(data['text'])
print(text_features.toarray())


[[0 0 1 0 1 0 0 0]
 [1 0 0 0 0 1 0 1]
 [0 1 0 1 0 0 1 0]]


6. Aggregation and Grouping:
Create features by aggregating data across groups. Useful for creating summary statistics for each group.

In [6]:
data = pd.DataFrame({'customer_id': [1, 2, 1, 2], 'purchase_amount': [50, 75, 100, 60]})
agg_data = data.groupby('customer_id')['purchase_amount'].agg(['mean', 'sum']).reset_index()
print(agg_data)


   customer_id  mean  sum
0            1  75.0  150
1            2  67.5  135


7. Interaction Features:
Create new features by combining existing features, like multiplying two numerical features to capture interactions.

In [7]:
data = pd.DataFrame({'height': [165, 175, 160], 'weight': [55, 70, 50]})
data['bmi'] = data['weight'] / (data['height'] / 100) ** 2
print(data)


   height  weight        bmi
0     165      55  20.202020
1     175      70  22.857143
2     160      50  19.531250


8. Polynomial Features:
Generate higher-order features by raising existing features to a power. Useful for capturing nonlinear relationships.

In [8]:
from sklearn.preprocessing import PolynomialFeatures

data = pd.DataFrame({'x': [2, 3, 4]})
poly = PolynomialFeatures(degree=2)
poly_features = poly.fit_transform(data)
print(poly_features)


[[ 1.  2.  4.]
 [ 1.  3.  9.]
 [ 1.  4. 16.]]


9. Time-Based Features:
Extract features like day of the week, hour of the day, or time since a specific event.

In [9]:
data = pd.DataFrame({'timestamp': ['2023-08-01 08:00:00', '2023-08-01 14:30:00']})
data['hour'] = pd.to_datetime(data['timestamp']).dt.hour
print(data)


             timestamp  hour
0  2023-08-01 08:00:00     8
1  2023-08-01 14:30:00    14


These examples showcase various feature engineering techniques that can enhance the quality and predictive power of your data for machine learning tasks. Remember that the choice of technique depends on the nature of your data, the problem you're solving, and your domain expertise.