# Evaluating the CMC Dataset with Feature Selection and Causality Analysis

## Overview
This notebook evaluates the **CMC dataset** in three different experiments. The goal is to enhance feature engineering by integrating **Large Language Models (LLMs)** and **causal inference techniques** to determine their impact on model performance.

## Dataset: Contraceptive Method Choice (CMC)
The CMC dataset contains demographic and socio-economic attributes of women and their corresponding **contraceptive method usage**. The target variable (`Contraceptive_method_used`) is a categorical label representing the type of contraceptive used:
- `1`: No contraception
- `2`: Short-term method
- `3`: Long-term method

## Experiments Conducted
We conduct three different experiments:
1. **Feature Selection Only** – Various feature selection methods (e.g., RFE, LassoCV, Mutual Information) are applied to enhance predictive accuracy.
2. **Causal Analysis first, then Feature Selection** – Causal analysis is performed to select features before applying feature selection techniques.
3. **Feature Selection First, then Causal Analysis** – Standard feature selection methods are applied first, followed by causal filtering.

## Objective
The goal is to analyze how feature selection and causal filtering affect the **model's predictive accuracy** and whether causality-based feature selection enhances generalization.

---


### Original CAAFE 
Running the original code published in the paper "Large Language Models for Automated Data Science: Introducing CAAFE for Context-Aware Automated Feature Engineering" serves as a baseline comparison. The main idea is to instruct GPT to either generate new features or remove existing ones and check in each iteration whether the feature engineering suggestions by the LLM improve accuracy. If they do, these features are added to the DataFrame; otherwise, these outputs are ignored. Since I do not have access to the GPT API, I used the free Gemmine API instead.

At the end of the next section, we can evaluate the accuracy of processing the dataset at index 2, which is the CMC dataset as described above, before and after applying CAAFE, using the TabPFNClassifier. This comparison shows the impact of the original CAAFE approach on model performance

In [1]:
# Import necessary libraries
from caafe import CAAFEClassifier  # Automated Feature Engineering for tabular datasets
from tabpfn import TabPFNClassifier  # Fast Automated Machine Learning for small tabular datasets
from sklearn.ensemble import RandomForestClassifier  # Alternative classifier (not used in this script)
import torch  # To check if GPU (CUDA) is available for faster computation
from caafe import data  # Utility functions for data loading and preprocessing
from sklearn.metrics import accuracy_score  # To evaluate model performance
from functools import partial  # Used to modify function behavior
import warnings
warnings.filterwarnings("ignore")  # Suppress unnecessary warnings
import logging

# Suppress logs from the DoWhy library to avoid unnecessary output
logging.getLogger("dowhy").setLevel(logging.ERROR)

# Load all available datasets for testing CAAFE
cc_test_datasets_multiclass = data.load_all_data()

# Select a specific dataset (index 2/4/6/8)
dataset_index = 2
ds = cc_test_datasets_multiclass[dataset_index]

# Split dataset into training and testing sets
ds, df_train, df_test, _, _ = data.get_data_split(ds, seed=0)

# Extract the target column name and dataset description
target_column_name = ds[4][-1]
dataset_description = ds[-1]

# Display dataset name (useful for debugging)
ds[dataset_index]  

# Convert dataset features into numeric format (CAAFE requires numerical features)
from caafe.preprocessing import make_datasets_numeric
df_train, df_test = make_datasets_numeric(df_train, df_test, target_column_name)

# Extract features (X) and labels (y) from training and test sets
train_x, train_y = data.get_X_y(df_train, target_column_name)
test_x, test_y = data.get_X_y(df_test, target_column_name)

### Setup Base Classifier (Before Feature Engineering)

# Uncomment to use RandomForestClassifier instead of TabPFNClassifier
# clf_no_feat_eng = RandomForestClassifier()

# # Initialize TabPFNClassifier, using GPU if available for better performance
clf_no_feat_eng = TabPFNClassifier(device=('cuda' if torch.cuda.is_available() else 'cpu'))

# Ensure that fit() is properly handled using partial (needed for integration with CAAFE)
clf_no_feat_eng.fit = partial(clf_no_feat_eng.fit)

# Train the classifier on the original dataset (before feature engineering)
clf_no_feat_eng.fit(train_x, train_y)

# Make predictions on the test set
pred = clf_no_feat_eng.predict(test_x)

# Calculate accuracy before applying CAAFE
acc = accuracy_score(pred, test_y)
print(f'Accuracy before CAAFE: {acc}')

### Setup and Run CAAFE (Feature Engineering) - Uses OpenAI API

# Initialize the CAAFEClassifier with the base classifier (TabPFNClassifier)
# - Uses GPT-4 as the LLM model
# - Runs for 10 iterations to generate new features
caafe_clf = CAAFEClassifier(base_classifier=clf_no_feat_eng,
                            llm_model="gpt-4",
                            iterations=10)

# Define different experiment types for feature engineering and selection
experiments = ["Feature_Selection", "Causality_Feature_Selection", "Feature_Selection_Causality", "Base"]

# Define different causal analysis methods
causal_methods = ["individual", "dag", "psm"]

# Specify filenames for saving model attributes and dataset snapshots
model_attributes_file = f"models/model_attributes_{dataset_index}.pkl"
df_train_file = f"df/df_train_{dataset_index}.parquet"

# Display the baseline features before CAAFE modifies them
print(f"Baseline Features: {df_train.columns.values}")
# Apply CAAFE to generate new features and enhance the dataset
caafe_clf.fit_pandas(df_train, df_test, causal_methods[0], experiments[3], model_attributes_file, df_train_file,
                     target_column_name=target_column_name,
                     dataset_description=dataset_description)

# Make predictions using the enhanced dataset after CAAFE
pred = caafe_clf.predict(df_test)

# Calculate accuracy after applying CAAFE
acc = accuracy_score(pred, test_y)
print(f'Accuracy after CAAFE: {acc}')




Number of datasets: 10
Loading balance-scale 11 ..
Loading breast-w 15 ..
Loading cmc 23 ..
Loading credit-g 31 ..
Loading diabetes 37 ..
Loading tic-tac-toe 50 ..
Loading eucalyptus 188 ..
Loading pc1 1068 ..
Loading airlines 1169 ..
Loading jungle_chess_2pcs_raw_endgame_complete 41027 ..
health-insurance-lead-prediction-raw-data at datasets_kaggle/health-insurance-lead-prediction-raw-data/Health Insurance Lead Prediction Raw Data.csv not found, skipping...
pharyngitis at datasets_kaggle/pharyngitis/pharyngitis.csv not found, skipping...
spaceship-titanic at datasets_kaggle/spaceship-titanic/train.csv not found, skipping...
playground-series-s3e12 at datasets_kaggle/playground-series-s3e12/train.csv not found, skipping...
Downsampling balance-scale to 20.0% of samples
Downsampling breast-w to 10.0% of samples
Downsampling tic-tac-toe to 10.0% of samples
Accuracy before CAAFE: 0.5962059620596206
Baseline Features: ['Wifes_age' 'Wifes_education' 'Husbands_education'
 'Number_of_children

### Experiment 1: Evaluating Feature Selection

In this experiment, we apply various feature selection methods to evaluate their impact on model accuracy. The feature selection methods used include:

- Recursive Feature Elimination (RFE)
- LassoCV
- ElasticNetCV
- Mutual Information
- Boruta
- Gradient Boosting
- Decision Tree

 Generated Features (All)

This represents all features suggested by the Large Language Model (LLM), aggregated over five iterations without any removals. For each feature selection method, we identify and retain only the relevant features, discarding the rest. The effectiveness of each method is gauged by the accuracy improvement of the `TabPFNClassifier` using the selected features.

 Dataset and Method Variability

- **Dataset Variability**: By modifying the `dataset_index` (values: 2, 4, 6, 8), the experiment can be conducted on different datasets, enabling a broader evaluation of feature selection impacts. The datasets corresponding to these indices are:
  - **2**: CMC
  - **4**: Diabetes
  - **6**: Eucalyptus
  - **8**: Airlines

- **Method Variability**: Changing the `experiment_index` allows us to explore different combinations of feature selection and causality approaches, based on indices in the following list:
  - **0**: Feature_Selection
  - **1**: Causality_Feature_Selection
  - **2**: Feature_Selection_Causality
  - **3**: Base (Original CAAFE without additional feature selection or causality adjustments)

The primary objective is to determine if feature selection improves the overall performance of CAAFE and to identify which method contributes most significantly to accuracy enhancement.

In [3]:
caafe_clf.fit_pandas(df_train, df_test, causal_methods[0], experiments[0], model_attributes_file, df_train_file,
                     target_column_name=target_column_name,
                     dataset_description=dataset_description)

Loading pre-processed dataset and attributes from files...
Generated Features (All): ['Wifes_education', 'Husbands_education', 'Wifes_religion', 'Wifes_now_working%3F', 'Husbands_occupation', 'Standard-of-living_index', 'Media_exposure', 'Age_Children_Interaction', 'Age_squared', 'Education_diff', 'Children_age_interaction', 'Religion_working', 'Husbands_occupation_education']

Applying feature selection using RFE...
Selected features (RFE): ['Wifes_education', 'Husbands_education', 'Husbands_occupation', 'Standard-of-living_index', 'Age_Children_Interaction', 'Age_squared', 'Education_diff', 'Children_age_interaction', 'Religion_working', 'Husbands_occupation_education']
TabPFNClassifier Accuracy (RFE): 0.5880758807588076

Applying feature selection using LassoCV...
Selected features (LassoCV): ['Wifes_education', 'Husbands_education', 'Wifes_religion', 'Husbands_occupation', 'Standard-of-living_index', 'Media_exposure', 'Age_Children_Interaction', 'Age_squared', 'Children_age_interac

### Experiment 2: Evaluating Causal and then Feature Selection

In this experiment, we first perform a causality analysis to narrow down the features from the **Generated Features (All)**. This step helps in identifying which features have a potential causal relationship with the target variable.

Causal-Selected Features

After the causality analysis, the remaining features, termed as **Causal-Selected Features**, are those that are suspected to influence the target variable causally. These features are then subjected to further feature selection methods to refine the feature set even more.

Applying Feature Selection Methods

The refined feature set undergoes various feature selection techniques to ascertain the most significant predictors. The feature selection methods applied include:

- Recursive Feature Elimination (RFE)
- LassoCV
- ElasticNetCV
- Mutual Information
- Boruta
- Gradient Boosting
- Decision Tree

For each method, we evaluate the accuracy of the `TabPFNClassifier` using the selected features and print the results. This iterative process aims to enhance model performance by systematically reducing feature space based on causality and feature importance.

Note on Potential Errors

This section might encounter errors particularly with `dataset_index` values of **4** (Diabetes) and **8** (Airlines). After applying causality analysis, if we are left with only one feature, it becomes problematic to apply further feature selection methods as these typically require more than one feature to function effectively.

In [2]:
caafe_clf.fit_pandas(df_train, df_test, causal_methods[0], experiments[1], model_attributes_file, df_train_file,
                     target_column_name=target_column_name,
                     dataset_description=dataset_description)

Loading pre-processed dataset and attributes from files...
Generated Features (All): ['Wifes_education', 'Husbands_education', 'Wifes_religion', 'Wifes_now_working%3F', 'Husbands_occupation', 'Standard-of-living_index', 'Media_exposure', 'Age_Children_Interaction', 'Age_squared', 'Education_diff', 'Children_age_interaction', 'Religion_working', 'Husbands_occupation_education']
Performing causal analysis...
Causal-Selected Features: ['Wifes_education', 'Husbands_education', 'Media_exposure']
TabPFNClassifier Accuracy (causality): 0.43902439024390244

Applying feature selection using RFE...
Selected features (RFE): ['Wifes_education', 'Husbands_education', 'Media_exposure']
TabPFNClassifier Accuracy (RFE): 0.43902439024390244

Applying feature selection using LassoCV...
Selected features (LassoCV): ['Wifes_education', 'Husbands_education', 'Media_exposure']
TabPFNClassifier Accuracy (LassoCV): 0.43902439024390244

Applying feature selection using ElasticNetCV...
Selected features (Elasti

### Experiment 3: Evaluating Feature Selection and then Causal

This experiment mirrors Experiment 2, but reverses the order of the operations applied. Here, we first apply various feature selection methods to the **Generated Features (All)** before conducting a causality analysis. The aim is to evaluate how pre-selecting features based on their statistical importance influences the effectiveness of subsequent causal analysis.

Applying Feature Selection Methods First

Initially, we apply the following feature selection methods to narrow down the most impactful features:
- Recursive Feature Elimination (RFE)
- LassoCV
- ElasticNetCV
- Mutual Information
- Boruta
- Gradient Boosting
- Decision Tree

Each method helps in refining the feature set by identifying the most statistically significant predictors.

Causal Analysis on Selected Features

Post feature selection, the reduced set of features undergoes a causality analysis to further assess which of the selected features have a potential causal relationship with the target variable. 

Printing the Accuracy

After conducting both feature selection and causality analysis, we evaluate and print the accuracy of the `TabPFNClassifier` using the causally relevant features. This approach allows us to assess the combined impact of feature selection and causal analysis on model performance.

In [5]:
caafe_clf.fit_pandas(df_train, df_test, causal_methods[0], experiments[2], model_attributes_file, df_train_file,
                     target_column_name=target_column_name,
                     dataset_description=dataset_description)

Loading pre-processed dataset and attributes from files...
Generated Features (All): ['Wifes_education', 'Husbands_education', 'Wifes_religion', 'Wifes_now_working%3F', 'Husbands_occupation', 'Standard-of-living_index', 'Media_exposure', 'Age_Children_Interaction', 'Age_squared', 'Education_diff', 'Children_age_interaction', 'Religion_working', 'Husbands_occupation_education']

Applying feature selection using RFE...
Selected features (RFE): ['Wifes_education', 'Husbands_education']
TabPFNClassifier Accuracy (RFE): 0.43902439024390244

Applying feature selection using LassoCV...
Selected features (LassoCV): ['Wifes_education', 'Husbands_education', 'Media_exposure']
TabPFNClassifier Accuracy (LassoCV): 0.43902439024390244

Applying feature selection using ElasticNetCV...
Selected features (ElasticNetCV): ['Wifes_education', 'Husbands_education', 'Media_exposure']
TabPFNClassifier Accuracy (ElasticNetCV): 0.43902439024390244

Applying feature selection using Mutual Information...
Select