## Scenario
You work for an e-commerce company and want to predict whether a customer will make a purchase (Purchase: 1 = Yes, 0 = No). The dataset includes categorical features (like Region and Device_Type) and continuous features (like Browsing_Time and Total_Spent).

### Step 1: Generate the Dataset
We will simulate a dataset with 1,000 rows, including categorical features and continuous features.

In [41]:
import pandas as pd
import numpy as np
from sklearn.datasets import make_classification

# Set seed for reproducibility
np.random.seed(42)

data = {
    "Age": np.random.randint(18, 65, 1000),  # Continuous
    "Browsing_Time": np.random.uniform(5, 120, 1000),  # Continuous
    "Clicks": np.random.randint(1, 50, 1000),  # Continuous
    "Cart_Items": np.random.randint(0, 10, 1000),  # Continuous
    "Total_Spent": np.random.uniform(0, 500, 1000),  # Continuous
    "Discount_Code": np.random.choice([0, 1], size=1000),  # Categorical
    "Device_Type": np.random.choice(["Mobile", "Desktop", "Tablet"], size=1000),  # Categorical
    "Region": np.random.choice(["North", "South", "East", "West"], size=1000),  # Categorical
    "Purchase": np.random.choice([0, 1], size=1000),  # Target
}

# Convert to DataFrame
df = pd.DataFrame(data)

In [42]:
df.head()

Unnamed: 0,Age,Browsing_Time,Clicks,Cart_Items,Total_Spent,Discount_Code,Device_Type,Region,Purchase
0,56,42.070692,6,4,315.12361,0,Tablet,West,0
1,46,98.135561,32,4,124.191804,0,Tablet,East,0
2,32,34.283675,29,0,352.730415,0,Mobile,West,0
3,60,83.372813,43,8,213.800667,0,Tablet,West,0
4,25,92.426204,2,6,221.272757,0,Desktop,South,1


In [43]:
df.shape

(1000, 9)

In [44]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 9 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Age            1000 non-null   int32  
 1   Browsing_Time  1000 non-null   float64
 2   Clicks         1000 non-null   int32  
 3   Cart_Items     1000 non-null   int32  
 4   Total_Spent    1000 non-null   float64
 5   Discount_Code  1000 non-null   int32  
 6   Device_Type    1000 non-null   object 
 7   Region         1000 non-null   object 
 8   Purchase       1000 non-null   int32  
dtypes: float64(2), int32(5), object(2)
memory usage: 50.9+ KB


In [45]:
df['Device_Type'].value_counts()

Device_Type
Desktop    364
Mobile     334
Tablet     302
Name: count, dtype: int64

In [46]:
df['Region'].value_counts()

Region
South    260
North    256
East     249
West     235
Name: count, dtype: int64

In [47]:
# Encode categorical features for analysis
from sklearn.preprocessing import LabelEncoder

df['Device_Type'] = LabelEncoder().fit_transform(df["Device_Type"])
df['Region'] = LabelEncoder().fit_transform(df["Region"])
print(df.head())


   Age  Browsing_Time  Clicks  Cart_Items  Total_Spent  Discount_Code  \
0   56      42.070692       6           4   315.123610              0   
1   46      98.135561      32           4   124.191804              0   
2   32      34.283675      29           0   352.730415              0   
3   60      83.372813      43           8   213.800667              0   
4   25      92.426204       2           6   221.272757              0   

   Device_Type  Region  Purchase  
0            2       3         0  
1            2       0         0  
2            1       3         0  
3            2       3         0  
4            0       2         1  


In [48]:
df.head()

Unnamed: 0,Age,Browsing_Time,Clicks,Cart_Items,Total_Spent,Discount_Code,Device_Type,Region,Purchase
0,56,42.070692,6,4,315.12361,0,2,3,0
1,46,98.135561,32,4,124.191804,0,2,0,0
2,32,34.283675,29,0,352.730415,0,1,3,0
3,60,83.372813,43,8,213.800667,0,2,3,0
4,25,92.426204,2,6,221.272757,0,0,2,1


In [49]:
df['Device_Type'].value_counts()

Device_Type
0    364
1    334
2    302
Name: count, dtype: int64

In [50]:
df['Region'].value_counts()

Region
2    260
1    256
0    249
3    235
Name: count, dtype: int64

### **Step 2: Apply Filter Methods**
### 1. Variance Threshold (For Low-Variance Features):
We will check if any features have low variance and remove them.



In [52]:
df.head()

Unnamed: 0,Age,Browsing_Time,Clicks,Cart_Items,Total_Spent,Discount_Code,Device_Type,Region,Purchase
0,56,42.070692,6,4,315.12361,0,2,3,0
1,46,98.135561,32,4,124.191804,0,2,0,0
2,32,34.283675,29,0,352.730415,0,1,3,0
3,60,83.372813,43,8,213.800667,0,2,3,0
4,25,92.426204,2,6,221.272757,0,0,2,1


In [53]:
from sklearn.feature_selection import VarianceThreshold

# Apply variance threshold
selector = VarianceThreshold(threshold = 0.01)
selector.fit(df.drop(columns=["Purchase"]))


In [54]:
low_variance_features = df.drop(columns=["Purchase"]).columns[~selector.get_support()]
print(f"Low-Variance Features: {list(low_variance_features)}")

Low-Variance Features: []


From the output of the code, we can infer the following:

1. **No Low-Variance Features**:
   - The `low_variance_features` list is empty (`[]`), which means that **none of the features in the dataset have variance below the specified threshold of `0.01`**.
   - This implies that all features have sufficient variability in their values and may potentially carry useful information for the machine learning model.

2. **Feature Variability**:
   - All features in the dataset pass the variance threshold check, so no features will be dropped based on low variance.

3. **Next Steps**:
   - Since no features were removed due to low variance, we can proceed to apply other feature selection techniques (e.g., correlation, Chi-Square Test, ANOVA F-Test) to further evaluate feature importance and relevance to the target variable (`Purchase`).

This result indicates that the dataset is well-prepared in terms of feature variance, and all features should initially be retained for further analysis.

These lines of code are applying a **Variance Threshold** filter method to remove features with low variance. Let’s break it down step by step:

---

### **Code Breakdown**

#### **1. Import VarianceThreshold**
```python
from sklearn.feature_selection import VarianceThreshold
```
- `VarianceThreshold` is a filter method from Scikit-learn's `feature_selection` module.
- It identifies features with variance below a specified threshold and marks them for removal.

---

#### **2. Initialize the VarianceThreshold**
```python
selector = VarianceThreshold(threshold=0.01)
```
- `threshold=0.01`: This means any feature with variance less than `0.01` will be considered low variance and flagged for removal.
- Variance measures how spread out the values in a feature are. For example:
  - If all values in a column are the same (variance = 0), it provides no useful information for predictions.
  - Features with very low variance contribute little to distinguishing between data points.

---

#### **3. Apply VarianceThreshold to the Dataset**
```python
selector.fit(df.drop(columns=["Purchase"]))
```
- `df.drop(columns=["Purchase"])`: Excludes the target column (`Purchase`) from the analysis because we only want to evaluate the input features.
- `.fit()`: Computes the variance of each feature in the dataset. The `selector` now knows which features have variance below the threshold (`0.01`).

---

#### **4. Identify Low-Variance Features**
```python
low_variance_features = df.drop(columns=["Purchase"]).columns[~selector.get_support()]
```
- `selector.get_support()`: Returns a Boolean array indicating which features are **kept** (True) based on the threshold.
  - Example:
    - If there are 5 features and 3 pass the threshold, `get_support()` returns `[True, True, False, True, False]`.
- `~selector.get_support()`: The `~` operator inverts the Boolean array, marking the features that **do not pass** the threshold (low variance features).
  - Example:
    - Inverted array: `[False, False, True, False, True]`.
- `df.columns[~selector.get_support()]`: Selects the column names corresponding to the `False` values (low-variance features).

---

#### **5. Print Low-Variance Features**
```python
print(f"Low-Variance Features: {list(low_variance_features)}")
```
- `list(low_variance_features)`: Converts the column names of low-variance features into a list.
- Prints the names of the features with variance below `0.01`.

---

### **What Does This Do?**
This code:
1. Computes the variance of each feature in the dataset.
2. Flags features with variance below `0.01`.
3. Identifies and outputs the names of these low-variance features, which you can choose to drop from the dataset.

---

### **Why Is This Important?**
Low-variance features often:
- Provide little to no meaningful information.
- Increase the complexity of the model without improving its performance.
- Should be removed to simplify the dataset and improve efficiency.

---

### **Example**
Imagine a dataset:

| Feature1 | Feature2 | Feature3 | Purchase |
|----------|----------|----------|----------|
| 0.01     | Mobile   | 1        | 1        |
| 0.01     | Desktop  | 1        | 0        |
| 0.01     | Tablet   | 1        | 1        |

- `Feature1`: All values are nearly identical, so its variance is very low.
- `Feature3`: All values are the same (`1`), so its variance is `0`.

After running the code, the output might be:
```plaintext
Low-Variance Features: ['Feature1', 'Feature3']
```

These features can be dropped as they provide no useful information for prediction.

### 2. Correlation Coefficient (For Continuous Features)

We calculate the correlation between continuous features and the target (Purchase).

In [74]:
# Calculate correlation matrix
correlation_matrix = df.corr()
correlation_matrix

Unnamed: 0,Age,Browsing_Time,Clicks,Cart_Items,Total_Spent,Discount_Code,Device_Type,Region,Purchase
Age,1.0,-0.043343,-0.060234,0.011149,0.060819,0.007966,0.02324,-0.028701,-0.00339
Browsing_Time,-0.043343,1.0,0.015501,0.025316,0.041035,0.02566,0.051577,-0.02117,-0.048452
Clicks,-0.060234,0.015501,1.0,0.024454,0.07263,-0.021259,0.029652,0.048357,0.002869
Cart_Items,0.011149,0.025316,0.024454,1.0,-0.008352,-0.0288,-0.005198,-0.002173,0.043563
Total_Spent,0.060819,0.041035,0.07263,-0.008352,1.0,0.031412,-0.008481,-0.01352,-0.01747
Discount_Code,0.007966,0.02566,-0.021259,-0.0288,0.031412,1.0,-0.005969,-0.013494,-0.041132
Device_Type,0.02324,0.051577,0.029652,-0.005198,-0.008481,-0.005969,1.0,-0.01579,-0.032892
Region,-0.028701,-0.02117,0.048357,-0.002173,-0.01352,-0.013494,-0.01579,1.0,0.013941
Purchase,-0.00339,-0.048452,0.002869,0.043563,-0.01747,-0.041132,-0.032892,0.013941,1.0


In [76]:
# Extract correlation with the target
correlation_with_target = correlation_matrix["Purchase"].sort_values(ascending=False)
print("Correlation with Target:\n", correlation_with_target)

Correlation with Target:
 Purchase         1.000000
Cart_Items       0.043563
Region           0.013941
Clicks           0.002869
Age             -0.003390
Total_Spent     -0.017470
Device_Type     -0.032892
Discount_Code   -0.041132
Browsing_Time   -0.048452
Name: Purchase, dtype: float64


From the output of the correlation analysis, we can infer the following:

### **Key Observations:**
1. **Purchase (Target Variable)**:
   - The correlation of `Purchase` with itself is always 1 (as expected, because it’s perfectly correlated with itself).

2. **Cart_Items**:
   - Has the **highest positive correlation (0.043563)** with `Purchase` among the features.
   - This suggests that customers adding more items to their cart might have a slight tendency to make a purchase. However, the correlation is very weak (close to 0), so it’s not a strong predictor.

3. **Region** and **Clicks**:
   - Have very weak positive correlations with `Purchase` (0.013941 and 0.002869, respectively).
   - These features might not have much influence on predicting whether a customer makes a purchase.

4. **Age, Total_Spent, Device_Type, Discount_Code, Browsing_Time**:
   - All have **negative correlations** with `Purchase`, meaning an increase in these features might slightly decrease the likelihood of a purchase.
   - Among these, `Browsing_Time` (-0.048452) has the strongest negative correlation with `Purchase`, suggesting that spending more time browsing might reduce the likelihood of making a purchase (perhaps due to indecision or browsing without intent).

---

### **Insights from Correlation Values:**
- **Weak Correlations Overall**:
   - None of the features have a strong correlation (close to 1 or -1) with the target variable `Purchase`. This means that individually, these features may not be strong predictors of whether a customer makes a purchase.
   
- **Potentially Useful Features**:
   - While the correlations are weak, `Cart_Items` has the highest positive correlation, so it might still add some predictive power when used in combination with other features.

---

### **Next Steps:**
1. **Further Feature Selection**:
   - Combine this correlation analysis with other feature selection methods (e.g., Chi-Square Test, ANOVA F-Test) to assess feature relevance in different ways.
   
2. **Modeling and Interaction Effects**:
   - Sometimes features with weak individual correlations can become important when combined with others (interaction effects). Consider including features like `Cart_Items` and testing their importance in the model.

3. **Feature Engineering**:
   - Create new features (e.g., `Cart_Items * Discount_Code`) to capture potential interactions not visible in the correlation matrix.

---

### **Conclusion**:
- The correlation values suggest weak relationships between the individual features and `Purchase`.
- While `Cart_Items` shows the most promise, the weak correlations highlight the need for further analysis or model-based feature selection methods.

### 3. Chi-Square Test (For Categorical Features)

We use the Chi-Square test to measure the dependency between categorical features (Discount_Code, Device_Type, Region) and the target.

In [84]:
from sklearn.feature_selection import chi2

# Select categorical features:
categorical_features = df[["Discount_Code", "Device_Type", "Region"]]
chi_scores, p_values = chi2(categorical_features, df["Purchase"])

chi_square_results = pd.DataFrame({"Feature":categorical_features.columns, "Chi-Square": chi_scores, "P-Value": p_values})
print(chi_square_results)

         Feature  Chi-Square   P-Value
0  Discount_Code    0.884823  0.346884
1    Device_Type    0.763736  0.382162
2         Region    0.159783  0.689356


From the results of the Chi-Square test:

### **Understanding the Results**
1. **Chi-Square Test**:
   - The Chi-Square test evaluates the dependency between categorical features (`Discount_Code`, `Device_Type`, and `Region`) and the target variable (`Purchase`).
   - Higher Chi-Square values indicate stronger dependency (more influence on the target).
   - P-values tell us the statistical significance of this dependency. A low p-value (< 0.05) suggests that the feature is significantly related to the target.

2. **Feature Analysis**:
   - **Discount_Code**:
     - Chi-Square = 0.884823
     - P-value = 0.346884
     - The high p-value suggests that `Discount_Code` is not significantly associated with `Purchase`.
   - **Device_Type**:
     - Chi-Square = 0.763736
     - P-value = 0.382162
     - Again, the high p-value indicates no significant association between `Device_Type` and `Purchase`.
   - **Region**:
     - Chi-Square = 0.159783
     - P-value = 0.689356
     - `Region` has an even higher p-value, confirming no strong dependency with `Purchase`.

---

### **What Can Be Inferred?**
1. **No Significant Dependency**:
   - None of the tested categorical features (`Discount_Code`, `Device_Type`, `Region`) show a significant relationship with the target variable `Purchase` based on the high p-values (all above 0.05).

2. **Feature Exclusion**:
   - These categorical features might not be valuable for predicting `Purchase` and can potentially be excluded from the model. However, it's worth checking interaction effects or combining them with other features before outright removal.

3. **Next Steps**:
   - Further analysis with other feature selection methods (e.g., wrapper or embedded methods) could provide additional insights.
   - Consider creating interaction terms or conducting feature engineering to identify hidden patterns not captured by the Chi-Square test.

---

### **Conclusion**:
Based on the Chi-Square results, none of the categorical features appear to have a strong or statistically significant relationship with `Purchase`. This indicates they might not contribute much to the predictive power of the model but should be further analyzed before exclusion.

You're absolutely correct that **higher Chi-Square values generally indicate a stronger dependency between the feature and the target variable**. Let’s clarify the results and address this apparent contradiction.

---

### **Key Points to Understand**

1. **Chi-Square Value**:
   - A high Chi-Square value (e.g., for `Discount_Code` and `Device_Type`) suggests that there **may be a relationship** between the feature and the target (`Purchase`).
   - However, **Chi-Square alone does not determine significance**; it must be interpreted in conjunction with the **P-value**.

2. **P-Value**:
   - The **P-value** indicates whether the observed Chi-Square value is statistically significant (i.e., unlikely to occur by chance).
   - A low P-value (< 0.05) means the feature's association with the target is statistically significant.
   - A high P-value (> 0.05) suggests that the Chi-Square value is not significant, even if it's high numerically.

---

### **Interpreting the Results**

#### **1. Discount_Code**
- **Chi-Square = 0.884823**: A relatively high value.
- **P-value = 0.346884**: Indicates that the relationship between `Discount_Code` and `Purchase` is **not statistically significant**. In other words, while `Discount_Code` may show some association with `Purchase`, this relationship might be due to random chance.

#### **2. Device_Type**
- **Chi-Square = 0.763736**: Also relatively high.
- **P-value = 0.382162**: Again, this suggests the association between `Device_Type` and `Purchase` is **not statistically significant**.

#### **3. Region**
- **Chi-Square = 0.159783**: A low value.
- **P-value = 0.689356**: Indicates no significant relationship between `Region` and `Purchase`.

---

### **How to Think About This?**
- A **high Chi-Square value** without a **low P-value** means the observed association may not be meaningful.
- **P-values are crucial for deciding statistical significance**. Without a low P-value, you cannot confidently say the feature has a real dependency on the target variable.

---

### **What Should You Do Next?**
1. **Keep Discount_Code and Device_Type for Further Testing**:
   - Since these features have relatively higher Chi-Square values, you may keep them for further analysis using other techniques (e.g., feature importance in tree-based models).
   - Even though the P-values are high, these features could still contribute to model performance in combination with other features.

2. **Consider Removing Region**:
   - `Region` has both a low Chi-Square value and a high P-value, suggesting it’s likely irrelevant for predicting `Purchase`.

3. **Check Feature Interactions**:
   - Sometimes, a feature might not show a significant relationship individually but could be important in interaction with other features. For example, `Discount_Code` might interact with `Device_Type` (e.g., discounts might work better on mobile devices).

4. **Model Validation**:
   - Ultimately, use your model’s performance metrics (e.g., accuracy, AUC-ROC) to validate the usefulness of these features.

---

### **Conclusion**
- A high Chi-Square value indicates potential influence, but without a low P-value, it’s not statistically significant.
- Keep `Discount_Code` and `Device_Type` for further investigation, and consider dropping `Region` based on this analysis.

### 4. ANOVA F-Test (For Continuous Features)

The ANOVA F-test evaluates the relationship between continuous features and the categorical target (Purchase).

In [91]:
from sklearn.feature_selection import f_classif

# Select continuous features
continuous_features = df[["Age", "Browsing_Time", "Clicks", "Cart_Items", "Total_Spent"]]

# Apply ANOVA F-Test
f_scores, p_values = f_classif(continuous_features, df['Purchase'])

anova_results = pd.DataFrame({"Feature": continuous_features.columns, "F-Score": f_scores, "P-Value":p_values})
print(anova_results)


         Feature   F-Score   P-Value
0            Age  0.011466  0.914746
1  Browsing_Time  2.348455  0.125724
2         Clicks  0.008213  0.927807
3     Cart_Items  1.897531  0.168664
4    Total_Spent  0.304670  0.581093


From the results of the **ANOVA F-Test**, we can draw the following inferences:

---

### **Key Observations**

1. **F-Score**:
   - The F-Score measures the ratio of variance explained by the feature compared to the variance within the data. A higher F-Score indicates that the feature is more relevant in distinguishing between the classes of the target (`Purchase`).

2. **P-Value**:
   - The P-value indicates whether the F-Score is statistically significant. A low P-value (< 0.05) suggests the feature is significantly related to the target variable.
   - If the P-value is high (> 0.05), it implies the feature does not significantly contribute to predicting the target.

---

### **Feature Analysis**

#### **1. Age**
   - **F-Score = 0.011466**
   - **P-Value = 0.914746**
   - The very low F-Score and high P-value suggest that `Age` has no significant relationship with `Purchase`.

#### **2. Browsing_Time**
   - **F-Score = 2.348455**
   - **P-Value = 0.125724**
   - The F-Score is relatively higher, but the P-value is above 0.05, indicating that `Browsing_Time` does not significantly differentiate between customers who make a purchase and those who don’t.

#### **3. Clicks**
   - **F-Score = 0.008213**
   - **P-Value = 0.927807**
   - The low F-Score and high P-value suggest that `Clicks` is not a significant predictor of `Purchase`.

#### **4. Cart_Items**
   - **F-Score = 1.897531**
   - **P-Value = 0.168664**
   - While `Cart_Items` has a higher F-Score compared to some other features, the P-value is still above 0.05, meaning it is not statistically significant.

#### **5. Total_Spent**
   - **F-Score = 0.304670**
   - **P-Value = 0.581093**
   - The low F-Score and high P-value suggest that `Total_Spent` is not a significant predictor of `Purchase`.

---

### **Overall Inferences**
1. **No Feature is Statistically Significant**:
   - None of the continuous features (`Age`, `Browsing_Time`, `Clicks`, `Cart_Items`, `Total_Spent`) have P-values below 0.05, indicating that they do not significantly contribute to predicting `Purchase`.

2. **Relative Importance**:
   - While not statistically significant, `Browsing_Time` and `Cart_Items` have relatively higher F-Scores, suggesting they might still carry some predictive power when combined with other features or through feature engineering.

3. **Irrelevant Features**:
   - `Age`, `Clicks`, and `Total_Spent` have very low F-Scores and high P-values, suggesting they contribute very little and can potentially be dropped.

---

### **Next Steps**
1. **Feature Engineering**:
   - Create interaction terms (e.g., `Cart_Items * Discount_Code`) or transformations (e.g., log of `Browsing_Time`) to uncover hidden patterns.
2. **Model-Based Feature Selection**:
   - Use tree-based models (e.g., Random Forest) to evaluate feature importance in a more flexible way, as they can capture non-linear relationships and interactions.
3. **Validation Through Modeling**:
   - Validate the results by training a machine learning model and analyzing its performance with and without these features.

---

### **Conclusion**
None of the continuous features are statistically significant predictors of `Purchase` individually. However, `Browsing_Time` and `Cart_Items` might still hold some value and should be retained for further investigation, while `Age`, `Clicks`, and `Total_Spent` are likely candidates for removal.

### Summary
- Variance Threshold: Identified and removed low-variance features.
- Correlation Analysis: Kept continuous features with strong correlation to the target.
- Chi-Square Test: Selected categorical features highly dependent on the target.
- ANOVA F-Test: Chose continuous features significantly related to the target.

## Using Wrapper Method for feature selection

Wrapper methods use a machine learning model to evaluate subsets of features. They are computationally more expensive than filter methods but often provide better results because they are model-specific.

In other words, Wrapper methods involve selecting features based on their impact on a machine learning model's performance. Unlike filter methods, wrapper methods evaluate subsets of features by training models and measuring performance, making them computationally intensive but often more accurate.

Here’s how we can use wrapper methods step by step:

### 1. Forward Selection:
Forward selection starts with no features and adds features one by one, selecting the feature that improves the model's performance the most at each step.

In [97]:
df.head()

Unnamed: 0,Age,Browsing_Time,Clicks,Cart_Items,Total_Spent,Discount_Code,Device_Type,Region,Purchase
0,56,42.070692,6,4,315.12361,0,2,3,0
1,46,98.135561,32,4,124.191804,0,2,0,0
2,32,34.283675,29,0,352.730415,0,1,3,0
3,60,83.372813,43,8,213.800667,0,2,3,0
4,25,92.426204,2,6,221.272757,0,0,2,1


In [100]:
from sklearn.feature_selection import SequentialFeatureSelector
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

# Split the data into training and testing sets
X = df.drop(columns=["Purchase"])
y = df['Purchase']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Create a Logistic Regression model
model = LogisticRegression(max_iter=500)

# Apply forward selection
sfs_forward = SequentialFeatureSelector(model, n_features_to_select = "auto", direction="forward", cv=5)
sfs_forward.fit(X_train, y_train)

# selected features
forward_selected_features = X_train.columns[sfs_forward.get_support()]
print("Forward Selected Features:", list(forward_selected_features))

Forward Selected Features: ['Cart_Items', 'Discount_Code', 'Device_Type', 'Region']


### **Explanation of Code and Results**

#### **1. Import Required Libraries**
```python
from sklearn.feature_selection import SequentialFeatureSelector
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
```
- **SequentialFeatureSelector**: A wrapper method for feature selection that adds or removes features sequentially to find the best subset.
- **LogisticRegression**: A machine learning algorithm used here to evaluate the importance of features.
- **train_test_split**: Splits the dataset into training and testing sets.

---

#### **2. Split the Dataset into Features and Target**
```python
X = df.drop(columns=["Purchase"])
y = df["Purchase"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
```
- **`X`**: Contains all features (independent variables) by dropping the target column (`Purchase`).
- **`y`**: Contains the target column (`Purchase`).
- **`train_test_split`**:
  - Splits the dataset into:
    - Training set: 70% of the data (used for feature selection and model training).
    - Testing set: 30% of the data (reserved for validation).
  - **`random_state=42`** ensures reproducibility (same split every time).

---

#### **3. Create a Logistic Regression Model**
```python
model = LogisticRegression(max_iter=500)
```
- **LogisticRegression**:
  - A classification model used here to evaluate the performance of different feature subsets.
  - **`max_iter=500`** increases the number of iterations for convergence (default is 100).

---

#### **4. Apply Forward Selection**
```python
sfs_forward = SequentialFeatureSelector(model, n_features_to_select="auto", direction="forward", cv=5)
sfs_forward.fit(X_train, y_train)
```
- **`SequentialFeatureSelector`**:
  - A wrapper method for feature selection that uses cross-validation to evaluate feature subsets.
  - **`direction="forward"`**:
    - Starts with no features and adds one feature at a time.
    - At each step, it selects the feature that improves model performance the most (based on the `model` provided).
  - **`n_features_to_select="auto"`**:
    - Automatically selects the optimal number of features by cross-validation (using 5 folds here, `cv=5`).

---

#### **5. Extract Selected Features**
```python
forward_selected_features = X_train.columns[sfs_forward.get_support()]
print("Forward Selected Features:", list(forward_selected_features))
```
- **`sfs_forward.get_support()`**:
  - Returns a Boolean array indicating which features were selected (`True` for selected features).
- **`X_train.columns[sfs_forward.get_support()]`**:
  - Extracts the names of the features corresponding to `True` values in the Boolean array.

---

### **Results**
```plaintext
Forward Selected Features: ['Cart_Items', 'Discount_Code', 'Device_Type', 'Region']
```

#### **Interpretation of Results**
1. **Selected Features**:
   - The forward selection process identified the following features as the most important for predicting `Purchase`:
     - **Cart_Items**: Number of items added to the cart.
     - **Discount_Code**: Whether a discount code was used.
     - **Device_Type**: The type of device (e.g., mobile, desktop).
     - **Region**: The customer's region (e.g., North, South).

2. **Importance of These Features**:
   - These features were chosen because they contribute the most to the model's performance when added to the subset.
   - For example:
     - Customers who add more items to their cart (Cart_Items) might be more likely to purchase.
     - Discount codes and device types could influence purchasing decisions.

3. **Optimal Subset**:
   - By selecting these features, the model can be trained more efficiently (fewer features mean less complexity) while retaining high predictive power.

---

### **Why Forward Selection?**
- Forward selection is useful because it:
  - Iteratively builds the best feature subset by adding features one at a time.
  - Avoids evaluating all possible subsets (which would be computationally expensive).

This method ensures that only the most relevant features are included in the model, helping reduce overfitting and improving interpretability.

### 2. Backward Elimination

Backward elimination starts with all features and removes the least important feature one by one, based on its impact on model performance.

In [106]:
# Apply Backward Elimination
sfs_backward = SequentialFeatureSelector(model, n_features_to_select="auto", direction="backward", cv=5)
sfs_backward.fit(X_train, y_train)

# Selected features
backward_selected_features = X_train.columns[sfs_backward.get_support()]
print("Backward Selected Features:", list(backward_selected_features))

Backward Selected Features: ['Cart_Items', 'Discount_Code', 'Device_Type', 'Region']


### **Explanation of the Code and Inference**

#### **What Does the Code Do?**

1. **Backward Elimination**:
   - Starts with all features and removes the least important one-by-one, based on the model's performance.
   - The process continues until it finds the best subset of features.

2. **Steps in Code**:
   - **SequentialFeatureSelector**:
     - Uses a logistic regression model to evaluate feature importance.
     - **`direction="backward"`** specifies that it starts with all features and eliminates them sequentially.
     - **`n_features_to_select="auto"`** automatically determines the optimal number of features using 5-fold cross-validation (`cv=5`).
   - **`sfs_backward.get_support()`**:
     - Returns a Boolean mask of the features retained by the backward elimination process.
   - **Selected Features**:
     - Extracted by filtering `X_train.columns` with the Boolean mask.

---

#### **Result Output**
```plaintext
Backward Selected Features: ['Cart_Items', 'Discount_Code', 'Device_Type', 'Region']
```

---

### **Inferences**

1. **Selected Features**:
   - The backward elimination process retained the following features:
     - **Cart_Items**: Likely significant because the number of items in the cart may indicate the likelihood of making a purchase.
     - **Discount_Code**: Indicates whether the use of discounts influences purchases.
     - **Device_Type**: Shows if device preferences (e.g., mobile, desktop) affect purchasing behavior.
     - **Region**: Highlights if customer location plays a role in purchase behavior.

2. **Consistency**:
   - The selected features from backward elimination are the same as those selected by forward selection. This suggests strong agreement between the two methods, which further validates the importance of these features.

3. **Importance of Features**:
   - These features likely contribute the most to the predictive power of the logistic regression model.

4. **Optimal Subset**:
   - The selected features represent the most relevant subset for training the model, improving efficiency while maintaining performance.

---

### **Next Steps**
1. **Model Training**:
   - Train a logistic regression model or other machine learning algorithms using only the selected features to evaluate the impact on performance.
2. **Validation**:
   - Compare the performance (e.g., accuracy, F1-score, AUC-ROC) of models trained on the selected features versus all features.
3. **Comparison with Other Methods**:
   - Validate these results against Recursive Feature Elimination (RFE) to ensure consistency across wrapper methods.

---

### **Conclusion**
Backward elimination identified the same feature set as forward selection (`Cart_Items`, `Discount_Code`, `Device_Type`, and `Region`), suggesting that these are the most important features in predicting the target (`Purchase`). These features can be used for model building and optimization.

### 3. Recursive Feature Elimination (RFE)
RFE works by recursively removing the least important feature, based on the model’s feature importance scores, until the desired number of features is reached.

In [110]:
from sklearn.feature_selection import RFE

# Apply Recursive Feature Elimination
rfe_selector = RFE(estimator=model, n_features_to_select=5)
rfe_selector.fit(X_train, y_train)

# selected features:
rfe_selected_features = X_train.columns[rfe_selector.support_]
print("RFE Selected Features:", list(rfe_selected_features))

RFE Selected Features: ['Age', 'Cart_Items', 'Discount_Code', 'Device_Type', 'Region']


In [113]:
# Import RFE
from sklearn.feature_selection import RFE

# Apply Recursive Feature Elimination
rfe_selector = RFE(estimator=model, n_features_to_select=5)
rfe_selector.fit(X_train, y_train)

# Extract rankings
feature_ranks = pd.DataFrame({
    "Feature": X_train.columns,
    "Rank": rfe_selector.ranking_
}).sort_values(by="Rank")

# Display ranked features
print("Feature Rankings:")
print(feature_ranks)


Feature Rankings:
         Feature  Rank
0            Age     1
3     Cart_Items     1
5  Discount_Code     1
6    Device_Type     1
7         Region     1
1  Browsing_Time     2
2         Clicks     3
4    Total_Spent     4


### **Explanation of Code and Output**

#### **Code Breakdown**

##### **1. Import the Recursive Feature Elimination (RFE) Module**
```python
from sklearn.feature_selection import RFE
```
- **RFE**:
  - A wrapper method for feature selection.
  - Recursively removes the least important features based on the model's feature importance until the desired number of features is reached.

---

##### **2. Apply Recursive Feature Elimination**
```python
rfe_selector = RFE(estimator=model, n_features_to_select=5)
rfe_selector.fit(X_train, y_train)
```
- **`estimator=model`**:
  - Specifies the machine learning model used to evaluate feature importance. Here, it’s a logistic regression model (`model`).
- **`n_features_to_select=5`**:
  - The number of features to retain. RFE will keep the top 5 most important features based on the logistic regression model.
- **`fit(X_train, y_train)`**:
  - Fits the RFE algorithm to the training dataset (`X_train`, `y_train`), iteratively eliminating features until 5 remain.

---

##### **3. Extract Selected Features**
```python
rfe_selected_features = X_train.columns[rfe_selector.support_]
print("RFE Selected Features:", list(rfe_selected_features))
```
- **`rfe_selector.support_`**:
  - A Boolean mask where `True` indicates the selected features, and `False` indicates the removed features.
- **`X_train.columns[rfe_selector.support_]`**:
  - Filters the column names in `X_train` to include only the selected features.
- **Output**:
  - Displays the names of the top 5 selected features based on RFE.

---

#### **Output**
```plaintext
RFE Selected Features: ['Age', 'Cart_Items', 'Discount_Code', 'Device_Type', 'Region']
```

---

### **Interpretation of the Output**

1. **Selected Features**:
   - **'Age'**: RFE included `Age` in the top 5 features, possibly indicating it has some relevance in predicting the target (`Purchase`), even though it was not selected by forward or backward methods.
   - **'Cart_Items'**: Consistently selected across all methods, suggesting it’s a strong predictor of `Purchase`.
   - **'Discount_Code'**: Indicates the importance of discounts in influencing purchase behavior.
   - **'Device_Type'**: Suggests that the type of device used (mobile, desktop, etc.) impacts purchasing decisions.
   - **'Region'**: Location may play a role in predicting purchasing behavior.

2. **Differences from Forward/Backward Selection**:
   - **Addition of 'Age'**:
     - Unlike forward or backward selection, RFE considers `Age` significant, likely due to how the logistic regression model evaluates its importance in combination with other features.
   - The rest of the selected features are consistent with previous methods, highlighting their overall importance.

3. **Strength of RFE**:
   - RFE ranks features based on their contribution to the model’s predictive performance. It eliminates features iteratively, which might uncover features that interact with others, like `Age`.

---

### **Conclusion**
- RFE selected 5 features: **'Age', 'Cart_Items', 'Discount_Code', 'Device_Type', 'Region'**.
- While `Age` was not selected by forward or backward methods, RFE identified it as relevant, suggesting it may interact with other features to improve model performance.
- The other 4 features align with forward and backward selection, confirming their strong predictive power.

---

### **Next Steps**
1. **Model Validation**:
   - Train the model using the RFE-selected features and compare performance metrics (e.g., accuracy, AUC-ROC) with those using forward and backward-selected features.
2. **Feature Interaction**:
   - Investigate possible interactions involving `Age` and other features to explain why RFE retained it.
3. **Consensus Selection**:
   - Combine insights from all three wrapper methods to finalize the optimal feature set for your model.

In [115]:
# Display results from all methods
print("Forward Selection Features:", list(forward_selected_features))
print("Backward Elimination Features:", list(backward_selected_features))
print("RFE Selected Features:", list(rfe_selected_features))


Forward Selection Features: ['Cart_Items', 'Discount_Code', 'Device_Type', 'Region']
Backward Elimination Features: ['Cart_Items', 'Discount_Code', 'Device_Type', 'Region']
RFE Selected Features: ['Age', 'Cart_Items', 'Discount_Code', 'Device_Type', 'Region']


### **Inference from the Results**

The output displays the selected features from three different wrapper methods: **Forward Selection**, **Backward Elimination**, and **Recursive Feature Elimination (RFE)**. Let’s analyze the results:

---

### **Results**
1. **Forward Selection Features**:
   - `['Cart_Items', 'Discount_Code', 'Device_Type', 'Region']`
2. **Backward Elimination Features**:
   - `['Cart_Items', 'Discount_Code', 'Device_Type', 'Region']`
3. **RFE Selected Features**:
   - `['Age', 'Cart_Items', 'Discount_Code', 'Device_Type', 'Region']`

---

### **Key Observations**
1. **Common Features**:
   - `Cart_Items`, `Discount_Code`, `Device_Type`, and `Region` are consistently selected by **all three methods**.
   - These features are likely the most important and relevant predictors of the target variable (`Purchase`).

2. **Discrepancy in RFE**:
   - RFE includes an additional feature: **`Age`**, which was not selected by forward or backward methods.
   - This suggests that RFE identifies `Age` as contributing to the model’s performance, possibly due to interactions or relationships with other features.

3. **Agreement Between Forward and Backward**:
   - Forward Selection and Backward Elimination selected the exact same set of features, confirming the robustness of these methods in identifying the most critical predictors.

4. **Ranking Insight from RFE**:
   - While `Age` was not selected by forward or backward methods, its inclusion in RFE suggests it has some significance when considered along with other features.

---

### **Interpretation of Selected Features**
- **`Cart_Items`**:
  - The number of items in the cart is a strong predictor of whether a purchase will occur.
- **`Discount_Code`**:
  - The use of a discount code likely influences a customer’s decision to purchase.
- **`Device_Type`**:
  - The type of device (mobile, desktop, etc.) used for shopping may impact purchasing behavior.
- **`Region`**:
  - Geographic location might correlate with purchasing trends or behavior.
- **`Age` (RFE)**:
  - While not universally selected, `Age` could play a minor role, perhaps interacting with other features like `Cart_Items` or `Discount_Code`.

---

### **Next Steps**
1. **Model Evaluation**:
   - Train and evaluate models using features selected by each method and compare performance metrics (e.g., accuracy, F1 score, AUC-ROC) to validate the selected feature sets.
2. **Feature Engineering**:
   - Investigate potential interactions involving `Age` (e.g., `Age * Cart_Items`) to understand why RFE identified it as important.
3. **Final Feature Selection**:
   - Consider combining the insights from all three methods to create a robust final feature set:
     - Start with `Cart_Items`, `Discount_Code`, `Device_Type`, and `Region`.
     - Optionally include `Age` based on validation results.

---

### **Conclusion**
The agreement between Forward Selection and Backward Elimination strongly validates the importance of `Cart_Items`, `Discount_Code`, `Device_Type`, and `Region`. The inclusion of `Age` by RFE highlights the need to further investigate its role, as it could hold value in specific contexts or interactions. These selected features can now be used for training a machine learning model with optimized performance.

### Explanation of Results

- Forward Selection:
Provides the most critical subset of features, starting from no features and adding only those that improve model performance.

- Backward Elimination:
Provides a similar subset but starts with all features, removing the least important ones.

- RFE:
Uses a ranking mechanism to recursively remove the least important features. The final subset might differ slightly due to the recursive nature of the method.

### **Do Wrapper Methods Include Interaction Effects?**

Yes, **wrapper methods** can include **interaction effects**, but this depends on the machine learning model used as the estimator in the wrapper method. Wrapper methods evaluate feature subsets based on the performance of a predictive model, and some models inherently consider interaction effects while others do not.

---

### **Key Points on Interaction Effects in Wrapper Methods**
1. **What Are Interaction Effects?**
   - Interaction effects occur when the relationship between one feature and the target variable depends on the value of another feature.
   - Example: In an e-commerce context, the effect of a `Discount_Code` on purchase likelihood might vary depending on the `Device_Type`.

2. **Wrapper Methods and Interaction Effects**:
   - Wrapper methods rely on the underlying model (e.g., logistic regression, decision trees) to evaluate the performance of feature subsets. The ability to capture interaction effects depends on the model used:
     - **Linear Models (e.g., Logistic Regression)**:
       - Do **not automatically consider interactions** unless you explicitly include interaction terms in the feature set (e.g., `Discount_Code * Device_Type`).
       - You need to engineer interaction features manually if using these models.
     - **Tree-Based Models (e.g., Random Forest, XGBoost)**:
       - Automatically capture interaction effects because they split data hierarchically and consider combinations of feature splits.
       - For instance, a decision tree might split first on `Discount_Code` and then on `Device_Type`, inherently modeling the interaction between the two.

3. **Recursive Feature Elimination (RFE)**:
   - RFE with tree-based models can identify interaction effects because the model's feature importance reflects both individual and interaction contributions.
   - With linear models, RFE cannot detect interaction effects unless interaction terms are explicitly included in the feature set.

4. **Sequential Feature Selectors (Forward/Backward Selection)**:
   - These methods evaluate subsets of features sequentially, but their ability to include interaction effects depends on the estimator:
     - **Linear Models**: Require manual inclusion of interaction terms.
     - **Tree-Based Models**: Automatically account for interactions during evaluation.

---

### **How to Handle Interaction Effects in Wrapper Methods?**

1. **If Using Linear Models**:
   - Manually create interaction terms as part of the feature engineering process.
   - Example:
     ```python
     df['Discount_Device_Interaction'] = df['Discount_Code'] * df['Device_Type']
     ```
   - Include these interaction terms in the feature set for wrapper methods.

2. **If Using Tree-Based Models**:
   - Tree-based models like Random Forest and XGBoost inherently handle interaction effects, so no additional steps are needed.

3. **Hybrid Approach**:
   - Combine wrapper methods with feature engineering to test both individual features and interaction effects.

---

### **Example**

#### **With a Linear Model (Logistic Regression)**:
```python
from sklearn.linear_model import LogisticRegression
from sklearn.feature_selection import RFE

# Include interaction terms in the feature set
X['Discount_Device_Interaction'] = X['Discount_Code'] * X['Device_Type']

# Apply RFE
model = LogisticRegression(max_iter=500)
rfe_selector = RFE(estimator=model, n_features_to_select=5)
rfe_selector.fit(X, y)

# Selected Features
print("Selected Features:", X.columns[rfe_selector.support_])
```

#### **With a Tree-Based Model (Random Forest)**:
```python
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_selection import RFE

# Use Random Forest for RFE
model = RandomForestClassifier(n_estimators=100)
rfe_selector = RFE(estimator=model, n_features_to_select=5)
rfe_selector.fit(X, y)

# Selected Features
print("Selected Features:", X.columns[rfe_selector.support_])
```

---

### **Conclusion**
- Wrapper methods **can include interaction effects** if the model used as an estimator supports them.
- For **linear models**, interaction terms need to be manually engineered.
- For **tree-based models**, interaction effects are inherently captured, making them more robust for complex feature relationships.
- To fully capture interaction effects, it’s often best to use tree-based models in wrapper methods or explicitly engineer interactions for linear models.