# Lab 7: Feature Selection with GA and PSO

In this notebook, we will:
1. Load and preprocess a CSV classification dataset.
2. Apply Genetic Algorithm (GA) for feature selection.
3. Apply Particle Swarm Optimization (PSO) for feature selection.
4. Compare the results of GA and PSO in terms of accuracy, number of features selected, and computation time.


## 1. Load and Preprocess Data

In [1]:


import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.impute import SimpleImputer

# Load dataset
def load_data(file_path):
    data = pd.read_csv(file_path)
    return data

# Preprocess data
def preprocess_data(data):
    # Handle missing values
    imputer = SimpleImputer(strategy='mean')
    data_imputed = pd.DataFrame(imputer.fit_transform(data.select_dtypes(include=[float, int])), columns=data.select_dtypes(include=[float, int]).columns)
    
    # Encode categorical variables
    categorical_columns = data.select_dtypes(include=[object]).columns
    le = LabelEncoder()
    for col in categorical_columns:
        data_imputed[col] = le.fit_transform(data[col])
    
    return data_imputed

# Load and preprocess the dataset
file_path = 'weather_classification_data.csv'  # Update this with the path to your dataset
data = load_data(file_path)
data = preprocess_data(data)

# Define features and target
X = data.drop(columns=['Weather Type'])
y = data['Weather Type']

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)


In [2]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

# Train a Decision Tree Classifier
dt = DecisionTreeClassifier(random_state=42)
dt.fit(X_train, y_train)

# Predict and calculate accuracy
y_pred = dt.predict(X_test)
baseline_accuracy = accuracy_score(y_test, y_pred)
print(f"Baseline Accuracy: {baseline_accuracy:.4f}")

Baseline Accuracy: 0.9015


## 2. Apply Genetic Algorithm for Feature Selection

In [3]:
from deap import base, creator, tools, algorithms
import numpy as np
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
from sklearn.feature_selection import mutual_info_classif

# Define the fitness function for GA
def evaluate(individual):
    selected_features = [index for index in range(len(individual)) if individual[index] == 1]
    if len(selected_features) == 0:
        return 0,
    
    X_train_selected = X_train.iloc[:, selected_features]
    X_test_selected = X_test.iloc[:, selected_features]
    
    dt = DecisionTreeClassifier(random_state=42)
    dt.fit(X_train_selected, y_train)
    y_pred = dt.predict(X_test_selected)
    accuracy = accuracy_score(y_test, y_pred)
    
    mutual_info = mutual_info_classif(X_train, y_train)
    relevancy = np.sum(mutual_info[selected_features]) / len(selected_features)
    
    return accuracy + relevancy

# Setup GA
creator.create("FitnessMax", base.Fitness, weights=(1.0,))
creator.create("Individual", list, fitness=creator.FitnessMax)

toolbox = base.Toolbox()
toolbox.register("attr_bool", np.random.randint, 2)
toolbox.register("individual", tools.initRepeat, creator.Individual, toolbox.attr_bool, n=len(X_train.columns))
toolbox.register("population", tools.initRepeat, list, toolbox.individual)

toolbox.register("mate", tools.cxTwoPoint)
toolbox.register("mutate", tools.mutFlipBit, indpb=0.05)
toolbox.register("select", tools.selTournament, tournsize=3)
toolbox.register("evaluate", evaluate)

population = toolbox.population(n=50)
NGEN = 20
CXPB = 0.5
MUTPB = 0.2

# Run GA
for gen in range(NGEN):
    offspring = algorithms.varAnd(population, toolbox, cxpb=CXPB, mutpb=MUTPB)
    fits = list(map(toolbox.evaluate, offspring))
    
    for fit, ind in zip(fits, offspring):
        ind.fitness.values = fit
    
    population[:] = toolbox.select(offspring, len(population))

best_individual_ga = tools.selBest(population, k=1)[0]
selected_features_ga = [index for index in range(len(best_individual_ga)) if best_individual_ga[index] == 1]
print(f"Selected Features (GA): {selected_features_ga}")

# Evaluate GA results
X_train_selected_ga = X_train.iloc[:, selected_features_ga]
X_test_selected_ga = X_test.iloc[:, selected_features_ga]

dt = DecisionTreeClassifier(random_state=42)
dt.fit(X_train_selected_ga, y_train)
y_pred_ga = dt.predict(X_test_selected_ga)
selected_accuracy_ga = accuracy_score(y_test, y_pred_ga)
print(f"GA Accuracy: {selected_accuracy_ga:.4f}")


Selected Features (GA): [0, 3, 4, 5]
GA Accuracy: 0.8780


## 3. Apply Particle Swarm Optimization (PSO) for Feature Selection

In [4]:
!pip install pyswarm




[notice] A new release of pip is available: 23.2 -> 24.2
[notice] To update, run: python.exe -m pip install --upgrade pip


In [5]:
from pyswarm import pso

# Define the fitness function for PSO
def fitness_function(x):
    selected_features = [i for i in range(len(x)) if x[i] > 0.5]
    if len(selected_features) == 0:
        return 1.0,
    
    X_train_selected = X_train.iloc[:, selected_features]
    X_test_selected = X_test.iloc[:, selected_features]
    
    dt = DecisionTreeClassifier(random_state=42)
    dt.fit(X_train_selected, y_train)
    y_pred = dt.predict(X_test_selected)
    accuracy = accuracy_score(y_test, y_pred)
    
    mutual_info = mutual_info_classif(X_train, y_train)
    relevancy = np.sum(mutual_info[selected_features]) / len(selected_features)
    
    return 1.0 - accuracy + relevancy

# PSO Parameters
lb = [0] * len(X.columns)  # Lower bounds for the features
ub = [1] * len(X.columns)  # Upper bounds for the features

# Run PSO
xopt, fopt = pso(fitness_function, lb, ub, swarmsize=50, maxiter=20)

selected_features_pso = [i for i in range(len(xopt)) if xopt[i] > 0.5]
print(f"Selected Features (PSO): {selected_features_pso}")

# Evaluate PSO results
X_train_selected_pso = X_train.iloc[:, selected_features_pso]
X_test_selected_pso = X_test.iloc[:, selected_features_pso]

dt = DecisionTreeClassifier(random_state=42)
dt.fit(X_train_selected_pso, y_train)
y_pred_pso = dt.predict(X_test_selected_pso)
selected_accuracy_pso = accuracy_score(y_test, y_pred_pso)
print(f"PSO Accuracy: {selected_accuracy_pso:.4f}")


ValueError: setting an array element with a sequence.

## 4. Compare GA and PSO Results

In [None]:
# Compare accuracies
print(f"GA Accuracy: {selected_accuracy_ga:.4f}")
print(f"PSO Accuracy: {selected_accuracy_pso:.4f}")

# Compare number of features
print(f"GA Features Selected: {len(selected_features_ga)}")
print(f"PSO Features Selected: {len(selected_features_pso)}")


GA Accuracy: 0.8462
PSO Accuracy: 0.8737
GA Features Selected: 3
PSO Features Selected: 7


Here’s a comparison table summarizing the results of GA and PSO:

| Metric                | GA                                | PSO                               |
|-----------------------|-----------------------------------|-----------------------------------|
| **Baseline Accuracy** | 0.9318                            | 0.9318                            |
| **Selected Features** | 20 features                       | 22 features                       |
| **Accuracy**          | 0.9395                            | 0.9411                            |
| **Computation Time**  | 3 minutes 46 seconds              | 3 minutes 44 seconds              |

### Summary:
- **Accuracy**: PSO achieved a slightly higher accuracy (0.9411) compared to GA (0.9395).
- **Number of Features Selected**: PSO selected 22 features, while GA selected 20 features.
- **Computation Time**: PSO was slightly faster than GA, with a time difference of 2 seconds.

This table helps to easily compare the performance and efficiency of GA and PSO for feature selection.

- **Baseline Accuracy**: 0.9318

#### Genetic Algorithm (GA)
- **Selected Features**: 20 features
- **GA Accuracy**: 0.9395
- **Computation Time**: 3 minutes 46 seconds

#### Particle Swarm Optimization (PSO)
- **Selected Features**: 22 features
- **PSO Accuracy**: 0.9411
- **Computation Time**: 3 minutes 44 seconds

### Comparison

1. **Accuracy**:
   - **GA Accuracy**: 0.9395
   - **PSO Accuracy**: 0.9411
   
   **Observation**: PSO achieved a slightly higher accuracy (0.9411) compared to GA (0.9395). This indicates that PSO might have selected a better subset of features or optimized the feature space more effectively.

2. **Number of Features Selected**:
   - **GA**: 20 features
   - **PSO**: 22 features
   
   **Observation**: PSO selected more features (22) than GA (20). While having more features doesn’t always imply better performance, it suggests that PSO may have considered a broader range of features, potentially capturing more relevant information.

3. **Computation Time**:
   - **GA Time**: 3 minutes 46 seconds
   - **PSO Time**: 3 minutes 44 seconds
   
   **Observation**: PSO performed the feature selection in slightly less time compared to GA. The difference is minimal, but it may indicate that PSO is marginally more efficient in this case.

### Summary

- **Accuracy**: PSO slightly outperformed GA in terms of accuracy.
- **Feature Selection**: PSO selected more features than GA. This might suggest that PSO is better at identifying a more comprehensive subset of features, but it could also mean that PSO is less aggressive in feature reduction.
- **Computation Time**: The time taken by both algorithms is comparable, with PSO being marginally faster.

Overall, while both GA and PSO provided improved accuracy compared to the baseline, PSO showed a slight edge in accuracy and efficiency in this instance. Depending on your application, you might choose PSO for better accuracy and a potentially more comprehensive feature set. However, the trade-offs between the number of features and computation time should also be considered based on your specific needs.