# 📌 Day 5: Feature Engineering in Machine Learning  

## Why is Feature Engineering Important?  
Feature Engineering transforms raw data into meaningful inputs for ML models.  

## Key Techniques Covered  
✅ Handling Missing Values (`SimpleImputer`)  
✅ Encoding Categorical Variables (`OneHotEncoder`)  
✅ Scaling & Normalization (`StandardScaler`)  
✅ Feature Selection (`SelectKBest`)  
✅ Creating New Features (`pd.cut()`)


## Key Techniques of Feature Engineering

### 1️⃣ Handling Missing Data
🔸 Fill missing values using mean, median, or mode

🔸 Use SimpleImputer() from sklearn.impute

### 2️⃣ Encoding Categorical Variables
🔸 Convert text labels to numbers using One-Hot Encoding (OHE) or Label Encoding

### 3️⃣ Scaling & Normalization
🔸 Standardize data to make features uniform

🔸 Use StandardScaler() or MinMaxScaler()

### 4️⃣ Feature Selection
🔸 Remove low variance features

🔸 Use correlation matrix or SelectKBest()

### 5️⃣ Creating New Features
🔸 Combine existing columns to create new insights

🔸 Example: Creating Age Groups

In [1]:
# Importing Libraries

import pandas as pd
import numpy as np

from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.feature_selection import SelectKBest, f_classif

In [2]:
# Sample dataset
data = {'age': [25, np.nan, 35, 45, 18, np.nan, 60], 
        'gender': ['Male', 'Female', 'Female', 'Male', 'Female', 'Male', 'Female'],
        'salary': [50000, 60000, 55000, 70000, 45000, 75000, 80000]}

df = pd.DataFrame(data)

In [3]:
df

Unnamed: 0,age,gender,salary
0,25.0,Male,50000
1,,Female,60000
2,35.0,Female,55000
3,45.0,Male,70000
4,18.0,Female,45000
5,,Male,75000
6,60.0,Female,80000


# 1️⃣ Handling Missing Values

In [4]:
imputer = SimpleImputer(strategy='median')
df[['age']] = imputer.fit_transform(df[['age']])

In [5]:
print(df)

    age  gender  salary
0  25.0    Male   50000
1  35.0  Female   60000
2  35.0  Female   55000
3  45.0    Male   70000
4  18.0  Female   45000
5  35.0    Male   75000
6  60.0  Female   80000


# 2️⃣ Encoding Categorical Variables

In [6]:
encoder = OneHotEncoder(sparse_output=False, drop='first')
encoded_gender = encoder.fit_transform(df[['gender']])

df['Male'] = encoded_gender[:, 0]

In [18]:
print(df)

    age  gender  salary  Male
0  25.0    Male   50000   1.0
1  35.0  Female   60000   0.0
2  35.0  Female   55000   0.0
3  45.0    Male   70000   1.0
4  18.0  Female   45000   0.0
5  35.0    Male   75000   1.0
6  60.0  Female   80000   0.0


# 3️⃣ Scaling & Normalization

In [19]:
scaler = StandardScaler()

df[ ['age', 'salary']] = scaler.fit_transform(df[['age', 'salary']])

In [20]:
print(df)

        age  gender    salary  Male
0 -0.886936    Male -0.994850   1.0
1 -0.090968  Female -0.175562   0.0
2 -0.090968  Female -0.585206   0.0
3  0.705001    Male  0.643726   1.0
4 -1.444115  Female -1.404494   0.0
5 -0.090968    Male  1.053370   1.0
6  1.898954  Female  1.463014   0.0


# 4️⃣ Feature Selection

In [21]:
X = df[['age', 'salary', 'Male']]
y = [0, 1, 0, 1, 0, 1, 0]  # Example target variable

X_new = SelectKBest(score_func=f_classif, k=2).fit_transform(X, y)


In [23]:
print(X,y)
print(X_new)

        age    salary  Male
0 -0.886936 -0.994850   1.0
1 -0.090968 -0.175562   0.0
2 -0.090968 -0.585206   0.0
3  0.705001  0.643726   1.0
4 -1.444115 -1.404494   0.0
5 -0.090968  1.053370   1.0
6  1.898954  1.463014   0.0 [0, 1, 0, 1, 0, 1, 0]
[[-0.99484975  1.        ]
 [-0.17556172  0.        ]
 [-0.58520574  0.        ]
 [ 0.64372631  1.        ]
 [-1.40449377  0.        ]
 [ 1.05337032  1.        ]
 [ 1.46301434  0.        ]]


# 5️⃣ Creating New Features

In [24]:
df['age_group'] = pd.cut(df['age'], bins=[-1, -0.5, 0.5, 1.5], labels=['Young', 'Adult', 'Senior'])

In [25]:
print(df)

        age  gender    salary  Male age_group
0 -0.886936    Male -0.994850   1.0     Young
1 -0.090968  Female -0.175562   0.0     Adult
2 -0.090968  Female -0.585206   0.0     Adult
3  0.705001    Male  0.643726   1.0    Senior
4 -1.444115  Female -1.404494   0.0       NaN
5 -0.090968    Male  1.053370   1.0     Adult
6  1.898954  Female  1.463014   0.0       NaN
