## Feature Engineering:

- Feature Engineering is the process of converting raw data into meaningful numerical features so that a machine learning model can understand patterns and make better predictions.

### 1.Feature Exploration
- The process of analyzing and visualizing features to uncover patterns,distributions.correlations,and potential issue the dataset before applying machine learning.

### 2.Feature Creation
- The process of generating new meaningful features from existing data using domain knowledge or mathematical operations to help the model learn better patterns.

### 3.Feature Encoding
- The process of converting categorical or non-numeric features into numeric from so machine learning models can process them effectively.

### 4.Feature Scaling
- The process of transforming numerical features to a standard range or distribution so that no features dominates the learning process due to its scale.

### 5.Feature Selection
- The process of identifying and keeping only the most relevant features while removing irrelevant or redundant ones to improve model preformance and reduce overfitting.

## Feature Creation(Adds Intelligence):
### Defintion:
- Feature Creation is the process of creating new features from exisiting data using domain knowledge and mathematical operations to help the model learn better patterns.
### why it "Adds Intelligence":
- Raw data may not show hidden relationship
- New Features provide extra information
- Models cannot think logically-featutres guide learning

### Common Types Of Features Creation:
- Domain-based features(real-world logic)
- Mathematical combinations(sum,average,ratio)
- Interaction features(Combine effect of features)
- Polynomial features(non-linear relationships)

### Example:
- Total_marks = Math + Science
- Efficiency = Total_Marks / Study_Hours
### Impact on model
- Improves accuracy
- Helps models capture complex patterns
- makes simple models perform better

# Feature Creation

In [14]:
import pandas as pd
df=pd.DataFrame({
    "Study_Hours":[2,4,6,8,10],
    "Attendance":[60,70,80,90,95],
    "Maths":[40,55,65,75,80],
    "Science":[42,58,68,78,88]
})

In [16]:
df

Unnamed: 0,Study_Hours,Attendance,Maths,Science
0,2,60,40,42
1,4,70,55,58
2,6,80,65,68
3,8,90,75,78
4,10,95,80,88


# Domain-Based Feature

In [19]:
df["Total_Marks"] = df["Maths"]+df["Science"]
df

Unnamed: 0,Study_Hours,Attendance,Maths,Science,Total_Marks
0,2,60,40,42,82
1,4,70,55,58,113
2,6,80,65,68,133
3,8,90,75,78,153
4,10,95,80,88,168


# Mathematical Feature

In [22]:
df["Marks_per_Hours"] = df["Total_Marks"] / df["Study_Hours"]
df

Unnamed: 0,Study_Hours,Attendance,Maths,Science,Total_Marks,Marks_per_Hours
0,2,60,40,42,82,41.0
1,4,70,55,58,113,28.25
2,6,80,65,68,133,22.166667
3,8,90,75,78,153,19.125
4,10,95,80,88,168,16.8


# Interaction Feature

In [25]:
df["Study_Attendence_Interaction"]=df["Study_Hours"]*df["Attendance"]
df

Unnamed: 0,Study_Hours,Attendance,Maths,Science,Total_Marks,Marks_per_Hours,Study_Attendence_Interaction
0,2,60,40,42,82,41.0,120
1,4,70,55,58,113,28.25,280
2,6,80,65,68,133,22.166667,480
3,8,90,75,78,153,19.125,720
4,10,95,80,88,168,16.8,950


## Polynomial Feature

In [28]:
df["Study_Hours_Squared"]=df["Study_Hours"]**2
df

Unnamed: 0,Study_Hours,Attendance,Maths,Science,Total_Marks,Marks_per_Hours,Study_Attendence_Interaction,Study_Hours_Squared
0,2,60,40,42,82,41.0,120,4
1,4,70,55,58,113,28.25,280,16
2,6,80,65,68,133,22.166667,480,36
3,8,90,75,78,153,19.125,720,64
4,10,95,80,88,168,16.8,950,100


# Feature Selection(Remove Noise):
### Definition:
- Feature Selection is the process of selecting only relevant feature and removing irrelevant or redundant ones to improve performance.
### Why it"Removes Noise":
- Too Many features confuse the model
- Irrelevant features cause overfitting
- Noise reduces prediction accuracy

## Types of Feature Selection:
- Filter methods(correlation,chi-square)
- Wrapper methods(forward,backward,RFE)
- Embedded methods(Lasso,tree-based model)

### Example:
- Removing weakly correlated features
- Dropping features with zero importance
- Selecting top features using RFE(Recursive Feature Elimination)

### Impact on Model:
- Reduce overfitting
- Improves generalization
- Speeds up training

In [31]:
df

Unnamed: 0,Study_Hours,Attendance,Maths,Science,Total_Marks,Marks_per_Hours,Study_Attendence_Interaction,Study_Hours_Squared
0,2,60,40,42,82,41.0,120,4
1,4,70,55,58,113,28.25,280,16
2,6,80,65,68,133,22.166667,480,36
3,8,90,75,78,153,19.125,720,64
4,10,95,80,88,168,16.8,950,100


In [33]:
X=df.drop("Science",axis=1)
y=df["Science"]

In [35]:
X

Unnamed: 0,Study_Hours,Attendance,Maths,Total_Marks,Marks_per_Hours,Study_Attendence_Interaction,Study_Hours_Squared
0,2,60,40,82,41.0,120,4
1,4,70,55,113,28.25,280,16
2,6,80,65,133,22.166667,480,36
3,8,90,75,153,19.125,720,64
4,10,95,80,168,16.8,950,100


In [37]:
y

0    42
1    58
2    68
3    78
4    88
Name: Science, dtype: int64

# FILTER METHODS(Statistics-Based)

## Correlation Method

In [41]:
df.corr()["Science"].sort_values(ascending=False)

Science                         1.000000
Total_Marks                     0.998981
Maths                           0.995467
Study_Hours                     0.994309
Attendance                      0.994110
Study_Attendence_Interaction    0.983804
Study_Hours_Squared             0.958102
Marks_per_Hours                -0.968737
Name: Science, dtype: float64

## Selecting Important Features

In [44]:
corr=df.corr()["Science"].abs()
corr

Study_Hours                     0.994309
Attendance                      0.994110
Maths                           0.995467
Science                         1.000000
Total_Marks                     0.998981
Marks_per_Hours                 0.968737
Study_Attendence_Interaction    0.983804
Study_Hours_Squared             0.958102
Name: Science, dtype: float64

In [46]:
selected_features=corr[corr>0.8].index
selected_features

Index(['Study_Hours', 'Attendance', 'Maths', 'Science', 'Total_Marks',
       'Marks_per_Hours', 'Study_Attendence_Interaction',
       'Study_Hours_Squared'],
      dtype='object')