# Problem 3. Steps to Design a Feature Selection Algorithm (Module 3)
40 Points Total<br><br>

In the lecture notes an example of ranking and selecting the top attributes (features) from data is provided. This example will be the starting point for ranking your features of interest. <br>

In this problem you will be designing an intelligent algorithm for feature selection and ranking by identifying the most relevant features from a dataset that contribute most significantly to the outcome of interest.

# Fill before submitting
- **Student:** `Zach Hatzenbeller`
- **Course:** `Data Science Modeling & Analytics`
- **Homework:** `HW #1`
- **Date:** `2025-09-14`
- **Instructor:** `Ben Rodriguez, PhD`

---

## What to Submit in your `.ipynb`
Your notebook must include, in this order:

1. **Cover Block** — Name, course, HW #, date.  
2. **README (Execution & Setup)** — Python version; required packages + install steps; dataset source + download/use instructions; end‑to‑end run steps; any hardware notes (GPU/CPU).  
3. **Adjustable Inputs** — A single, clearly marked code cell where we can change paths, seeds, and key hyperparameters.  
4. **Problem Sections** — Each problem and sub‑part clearly labeled (e.g., “Problem 1 (a)”).  
5. **Results, Summary & Conclusions** — Your takeaways, trade‑offs, limitations.  
6. **References & Attributions** — Cite datasets, code you reused, articles, and **any AI tools** used (and how).

> **One file only**. The notebook must run **top‑to‑bottom** with no errors.

---

## How You’re Graded (what “full credit” looks like)

**1) Completeness & Problem Coverage (20%)**  
<div style="margin-left: 40px"> To earn full points, students must ensure that all parts of the assignment, including sub-questions, are fully answered. Both qualitative and quantitative components should be addressed where required, and any coding tasks must be implemented completely without omissions. </div>

**2) Writing Quality, Technical Accuracy & Justification (20%)**  
<div style="margin-left: 40px"> Writing should be clear, concise, and demonstrate graduate-level quality. All technical content must be correct, and reasoning should be sound and well-supported. Students are expected to justify their design choices and conclusions with logical arguments that reflect a strong understanding of the material. </div>   

**3) Quantitative Work (0% on this HW)**  
<div style="margin-left: 40px"> Assignments should clearly state all assumptions before attempting solutions. Derivations and calculations must be shown step by step, either in Markdown cells or through annotated code. Final results should be presented with appropriate units and precision, ensuring they are easy to interpret and technically correct. </div>
 
**4) Code Quality, Documentation & Execution (30%)**  
<div style="margin-left: 40px"> Code must run from top to bottom without errors, avoiding “Traceback” or other runtime issues. Programs should follow best practices for naming, formatting, and organization, with descriptive variables and functions. Meaningful comments should be included to explain key logic, making the code both efficient and easy to follow. </div>

**5) Examples, Test Cases & Visuals (20%)**  
<div style="margin-left: 40px"> Students should include realistic examples and test cases that demonstrate program functionality, with outputs clearly labeled. Figures and tables must be properly titled, captioned, and have labeled axes. For machine learning tasks, particularly those with imbalanced datasets such as Credit Card Fraud or NSL-KDD, evaluation metrics must go beyond simple accuracy and include measures like precision, recall, F1-score, and ROC or PR curves. </div>

**6) Notebook README & Reproducibility (10%)**  
<div style="margin-left: 40px"> Each notebook must include a README section containing the Python version, a list of required packages with installation instructions, dataset details with download information, and complete steps to run the notebook. The work should be fully reproducible on another system, with seeds set for consistency and relative paths used instead of system-dependent absolute paths. </div>

---

## README (Execution & Setup)

**Use this section to make your notebook reproducible.**

- **Python version:** `3.11.1`
- **Required packages:** `numpy`, `pandas`, `scikit-learn`, `matplotlib`
- **Install instructions (if non-standard):**
  ```bash
  pip install numpy pandas scikit-learn matplotlib
  ```
- **Datasets used:**
  - `mosapabdelghany/medical-insurance-cost-dataset` dataset was downloaded directly from kaggle with kagglehub
  - All steps are in order and will clean/transform the dataset if necessary
- **How to run this notebook:**
  1. Run all cells in order (Kernel → Restart & Run All).
  2. Verify that all outputs match those in the **Sample Tests** section.
  3. Ensure figures and tables render correctly.

**README hint**: Place `creditcard.csv` in `./data/` (or set `DATA_PATH` below). The dataset can be downloaded from Kaggle. Due to size limits, keep only relative paths in this notebook.

---

## Adjustable Inputs Example
```python
# === Adjustable Inputs (edit here) ===
DATA_PATH = "path/to/data.csv"     # use relative paths
RANDOM_SEED = 42
# Isolation Forest
IF_N_ESTIMATORS = 200
IF_MAX_SAMPLES = "auto"
IF_CONTAMINATION = 0.01            # will be swept below
# LOF
LOF_N_NEIGHBORS = 20
LOF_CONTAMINATION = 0.01           # will be swept below
# Contamination sweep values (example)
CONTAMINATION_GRID = [0.001, 0.005, 0.01, 0.02, 0.05]

## Problem Statement for Problem 3: Steps to Design a Feature Selection Algorithm

Use the following steps for your algorithm:
* Analyze the Data
* Define Selection Criteria
* Choose a Feature Selection Method
* Implement the Selection Algorithm
* Evaluate Feature Importance
* Iterate and Optimize

You will need to implement 2 of the following algorithms:
1) Recursive Feature Elimination (RFE)
   - Uses an external estimator to weigh the importance of features and recursively remove the least important ones.
2) Forward/Backward Selection
   - Iteratively adds/removes features based on model performance.
3) Decision Trees (e.g., Random Forest, Gradient Boosting Trees)
   - These models provide feature importance scores as a by-product of model training.
4) Principal Component Analysis (PCA)
   - For continuous variables, reduces dimensionality while preserving variance.
5) Linear Discriminant Analysis (LDA)
   - Useful for classification problems to find the feature combination that best separates classes.
6) Feature Importance from Neural Networks
   - Using techniques like permutation importance to assess the impact of each feature on neural network predictions.

# (a) [2.5 points] Analyze the Data

## Type your analysis of the data here ##

In [105]:
## In your analysis of the data consider the following: ##
# 1) Analyze the nature of the data (numerical, categorical, time-series, etc.).
# 2) Determine the type of problem (classification, regression).
import pandas as pd
import kagglehub
from pathlib import Path

# Download latest version
path = kagglehub.dataset_download("mosapabdelghany/medical-insurance-cost-dataset")
df = pd.read_csv(Path(path, "insurance.csv"))

print("Path to dataset files:", path)
df.head()

Path to dataset files: C:\Users\zhatz\.cache\kagglehub\datasets\mosapabdelghany\medical-insurance-cost-dataset\versions\1


Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19,female,27.9,0,yes,southwest,16884.924
1,18,male,33.77,1,no,southeast,1725.5523
2,28,male,33.0,3,no,southeast,4449.462
3,33,male,22.705,0,no,northwest,21984.47061
4,32,male,28.88,0,no,northwest,3866.8552


### One hot encode columns that are categorical

In [106]:
encoded = pd.get_dummies(df)*1 # multiply by 1 to get numeric binary for each true false column
encoded.head()

Unnamed: 0,age,bmi,children,charges,sex_female,sex_male,smoker_no,smoker_yes,region_northeast,region_northwest,region_southeast,region_southwest
0,19,27.9,0,16884.924,1,0,0,1,0,0,0,1
1,18,33.77,1,1725.5523,0,1,1,0,0,0,1,0
2,28,33.0,3,4449.462,0,1,1,0,0,0,1,0
3,33,22.705,0,21984.47061,0,1,1,0,0,1,0,0
4,32,28.88,0,3866.8552,0,1,1,0,0,1,0,0


# (b) [2.5 points] Define Selection Criteria

## Type your selection criteria here ##

The criteria to select a feature will be based on statistical significance. I will implement two feature selection methods. The first being recursive feature selection which will recursively eliminate features from the dataset until a certain number of features are reached. I plan to use linear regression to rank the importance of these features based on the regression coefficients. 

The second method for feature selection will be principal component analysis to reduce the number of features in the space. The criteria for selecting how many components to use will depend on the variance explained from those features. Ideally the variance explained will be greater than 90% of the original data. This would provide high enough variance to effectively build a model while reducing the features in the dataset.

# (c) [2.5 points] Choose a Feature Selection Method

## Type the chosen feature selection method here ##

- Recursive Feature Selection: will recusively select features using linear regression to determine the importance of each feature at each iteration
- Principal Component Analysis: will reduce the number of features in the dataset to a number of components either specified by the user or based on the variance explained by the reduction. Reducing the dimensionality of the features will allow for only the most important variance to be captured.

# (d) [20 points] Implement the Selection Algorithm

## Type your implementation of the selection algorithm here ##

In [107]:
## In implementing the selection algorithm consider the following: ##
# 1) Develop logic to evaluate features based on the chosen method.
# 2) Ensure scalability and efficiency, especially for large datasets.
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.pipeline import Pipeline

class RecursiveFeatureSelection:

    def __init__(self, X: pd.DataFrame, y: pd.Series, num_features: int, scaled: bool = False):
        self.X = X
        self.y = y
        self.columns = list(self.X.columns)
        self.num_features = num_features
        self.scaled = scaled
        
    def evaluate(self):
        # Create the pipeline
        len_col = len(self.X.columns)
        iterations = (len_col - self.num_features) if self.num_features < len_col else 1
        
        if not self.scaled:
            pipeline = Pipeline([
                ('scaler', StandardScaler()),  # Step 1: Standardize the features
                ('regressor', LinearRegression()) # Step 2: Apply Linear Regression
            ])
        else:
            pipeline = Pipeline([
                ('regressor', LinearRegression()) # Step 2: Apply Linear Regression
            ])

        for _ in range(iterations):
            X = self.X.values
            y = self.y.values
            reg = pipeline.fit(X, y)
            coef = reg["regressor"].coef_
            min_val = np.argmin(coef)
            remove_col = self.columns[min_val]
            self.X = self.X.loc[:, self.X.columns != remove_col]
            self.columns.remove(remove_col)
        
        X = self.X.values
        y = self.y.values
        reg = pipeline.fit(X, y)
        coef = reg["regressor"].coef_
        
        return self.X, coef

class PCAFeatureSelection:

    def __init__(self, X: pd.DataFrame, features: int = -1, var_explained: float = 0.9, scaled: bool = False):
        self.X = X
        self.features = features
        self.var_explained = var_explained
        self.scaled = scaled

    def evaluate(self):
        num_features = len(self.X.columns)

        if not self.scaled:
            self.X = StandardScaler().fit_transform(self.X)
        else:
            self.X = self.X.values
        
        if self.features > 0:
            self.X = PCA(n_components=self.features).fit_transform(self.X)
        
        if self.features < 0:
            for comp in range(1, num_features-1):
                pca_var = PCA(n_components=comp).fit(self.X)
                var_ratio = sum(pca_var.explained_variance_ratio_)
                if var_ratio >= self.var_explained:
                    self.features = comp
                    break

            self.X = PCA(n_components=self.features).fit_transform(self.X)

        return self.X, self.features
        
# Recursive feature selection
X = encoded.loc[:, encoded.columns != "charges"]
y = encoded.loc[:, encoded.columns == "charges"]
rfs = RecursiveFeatureSelection(X, y, num_features=4, scaled=False)
df_features, coef = rfs.evaluate()
display(df_features.head())

# PCA feature selection
X = encoded.loc[:, encoded.columns != "charges"]
y = encoded.loc[:, encoded.columns == "charges"]
pca = PCAFeatureSelection(X, features=4, var_explained=0.9, scaled=False)
df_decomp, num_comps = pca.evaluate()
print(f"PCA Number of Components: {num_comps}")
display(df_decomp)

Unnamed: 0,age,bmi,children,smoker_yes
0,19,27.9,0,1
1,18,33.77,1,0
2,28,33.0,3,0
3,33,22.705,0,0
4,32,28.88,0,0


PCA Number of Components: 4


array([[ 0.88105705,  2.8175999 , -0.75317391,  1.75760484],
       [ 0.82113249, -1.30845864,  1.63114951, -0.58722786],
       [ 0.81408993, -1.34196252,  1.61331106, -0.54154408],
       ...,
       [-1.07826824,  0.75390532,  2.09189278, -0.56093002],
       [-1.63890056,  0.45477051, -0.3789003 ,  1.60874176],
       [ 0.75432539,  2.83206643, -1.06794547, -1.29748481]])

# (e) [10 points] Evaluate Feature Importance

## Type your evaluation of feature importance here ##

In [108]:
## In evaluating feature importance consider the following: ##
# 1) Assess the impact of selected features on model performance.
# 2) Use techniques like cross-validation to avoid overfitting.
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.pipeline import Pipeline

# Split dataset into train and test
X = encoded.loc[:, encoded.columns != "charges"]
y = encoded.loc[:, encoded.columns == "charges"]
# X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# Before feature selection
pipeline = Pipeline([
    ('scaler', StandardScaler()),  # Step 1: Standardize the features
    ('regressor', LinearRegression()) # Step 2: Apply Linear Regression
])
scores = cross_val_score(estimator=pipeline, X=X, y=y, cv=5, scoring='r2')
print(f"Number of features before features selection: {X.shape[1]}")
print("Pre-feature selection Cross Validation R2 Value:")
for idx, r2 in enumerate(list(scores)):
    print(f"Fold {idx}: {r2:0.03f}")
print("")


rfs = RecursiveFeatureSelection(X, y, num_features=6, scaled=False)
df_features, coef = rfs.evaluate()
pipeline = Pipeline([
    ('regressor', LinearRegression()) # Step 2: Apply Linear Regression
])
scores = cross_val_score(estimator=pipeline, X=df_features, y=y, cv=5, scoring='r2')
print(f"Number of Features post-Recursion: {df_features.shape[1]}")
print("Recursive feature selection Cross Validation R2 Value:")
for idx, r2 in enumerate(list(scores)):
    print(f"Fold {idx}: {r2:0.03f}")
print(f"Important Features: {list(df_features.columns)}")
print("")


pca = PCAFeatureSelection(X, features=-1, var_explained=0.95, scaled=False)
df_decomp, num_comps = pca.evaluate()
pipeline = Pipeline([
    ('regressor', LinearRegression()) # Step 2: Apply Linear Regression
])
scores = cross_val_score(estimator=pipeline, X=df_decomp, y=y, cv=5, scoring='r2')
print(f"Number of Features post-PCA: {num_comps}")
print("PCA feature selection Cross Validation R2 Value:")
for idx, r2 in enumerate(list(scores)):
    print(f"Fold {idx}: {r2:0.03f}")

Number of features before features selection: 11
Pre-feature selection Cross Validation R2 Value:
Fold 0: 0.760
Fold 1: 0.701
Fold 2: 0.778
Fold 3: 0.735
Fold 4: 0.756

Number of Features post-Recursion: 6
Recursive feature selection Cross Validation R2 Value:
Fold 0: 0.763
Fold 1: 0.707
Fold 2: 0.779
Fold 3: 0.733
Fold 4: 0.756
Important Features: ['age', 'bmi', 'children', 'smoker_yes', 'region_northeast', 'region_northwest']

Number of Features post-PCA: 8
PCA feature selection Cross Validation R2 Value:
Fold 0: 0.761
Fold 1: 0.706
Fold 2: 0.778
Fold 3: 0.733
Fold 4: 0.756


# (f) [2.5 points] Iterate and Optimize the Algorithm

## Type your description of iterating and optimizing the algorithm here ##

To further optimize each algorithm I would do the following:
- Recursive feature selection: I would iterate through range of features that I would want to reduce the model to and then select the number of features that performs best in cross validation.
- PCA feature selection: I would iterate through a percent of variance explained by the data to select a set number of components that achieves that variance explained and will lead to the highest cross validation scores among the folds.

In this feature selection we ideally want to go for a simple model that retains or improves accuracy over a model using all features. Generally feature selection shows vast improvements when the data has a high number of dimensions. The dataset we work with in this notebook has 11 features which generally is not considered high. Using our feature selection methods though, we do see an improvement in most of the cross validation folds. This showcases that we can still improve a model even if the number of dimensions is not severly high.

# References
[1] Christopher M. Bishop. Neural Networks for Pattern Recognition. Oxford University Press, 1995.<br><br>
[2] Christopher Bishop. Pattern Recognition and Machine Learning. Springer, 2006. isbn: 0387310738.<br><br>
[3] Barry J. Shepherd C. Wayne Brown. Graphics File Formats: Reference and Guide. Manning
Publications, 1995. isbn: 1884777007.<br><br>
[4] Thomas H. Cormen et al. Introduction to Algorithms. 3rd. MIT Press, 2009. isbn: 780262033848.<br><br>
[5] W. R. Dillon and M. Goldstein. Multivariate Analysis Method and Applications. New York, NY:
John Wiley Sons, Inc, 1984.<br><br>
[6] Richard O. Duda, Peter E. Hart, and David G. Stork. Pattern Classification. 2nd. Wiley-
Interscience, 2000.<br><br>
[7] Duin et al. PRTools. https://cmp.felk.cvut.cz/cmp/software/stprtool/index.html.<br><br>
[8] L. Euler. “Nova Acta Acad. Sci. Petrop”. In: (1960).<br><br>
[9] R.A. Fisher. “The use of Multiple Measurements in Taxonomic Problems”. In: Proceedings of
Annals of Eugenics 7 (1936), pp. 179–188.<br><br>
[10] Vojtech Franc and Vaclav Hlavac. Statistical Pattern Recognition Toolbox. https://cmp.felk.
cvut.cz/cmp/software/stprtool/index.html.<br><br>
[11] Keinosuke Fukunaga. Introduction to Statistical Pattern Recognition. 1st. Academic Press, 1972.
isbn: 0122698509.<br><br>
[12] Keinosuke Fukunaga. Introduction to Statistical Pattern Recognition. 2nd. Academic Press, 1990.
isbn: 0122698517.<br><br>
[13] Herman H. Goldstine. A History of Numerical Analysis from the 16th through the 19th Century.
Springer New York, 1977. isbn: 978-0-387-90277-7.<br><br>
[14] H. Hotelling. “Analysis of a complex of statistical variables into principal components”. In: Jour-
nal of Educational Psychology 24 (1933), pp. 417–441.<br><br>
[15] Averill Law. Simulation Modeling and Analysis. 5th. Mcgraw-hill Series in Industrial Engineering
and Management, 2014.<br><br>
[16] Machine Learning at Waikato University. https://www.cs.waikato.ac.nz/~ml/index.html.<br><br>
[17] James D. Murry and William vanRyper. Encyclopedia of Graphics File Formats: The Com-
plete Reference on CD-ROM with Links to Internet Resources. 2nd. O’Reilly Media, 1996. isbn:
1565921615.<br><br>
[18] F. Pedregosa et al. “Scikit-learn: Machine Learning in Python”. In: Journal of Machine Learning
Research 12 (2011), pp. 2825–2830.<br><br>
[19] Casey J. Richards et al. “Multimodal data fusion using signal/image processing methods for
multi-class machine learning”. In: Signal Processing, Sensor/Information Fusion, and Target
Recognition XXXII. Ed. by Ivan Kadar, Erik P. Blasch, and Lynne L. Grewe. Vol. 12547. Inter-
national Society for Optics and Photonics. SPIE, 2023, 125470N. doi: 10.1117/12.2664987.
url: https://doi.org/10.1117/12.2664987.<br><br>
[20] Benjamin M. Rodriguez. “Multi-Class Classification for Identifying JPEG Steganography Em-
bedding Methods”. PhD thesis. Air Force Institute of Technology, 2008. url: https://scholar.
afit.edu/cgi/viewcontent.cgi?article=3642&context=etd.<br><br>
[21] Stuart Russell and Peter Norvig. Artificial Intelligence: A Modern Approach. 4th. Prentice Hall,
2020.<br><br>
[22] Amir Saeed et al. “Reinforcement learning application to satellite constellation sensor tasking”.
In: Artificial Intelligence and Machine Learning for Multi-Domain Operations Applications V.
Ed. by Latasha Solomon and Peter J. Schwartz. Vol. 12538. International Society for Optics and
Photonics. SPIE, 2023, 125381B. doi: 10.1117/12.2664346. url: https://doi.org/10.1117/
12.2664346.<br><br>
[23] C. E. Shannon. “Programming a Computer for Playing Chess”. In: Philosophical Magazine.
7th ser. 41.314 (1950).<br><br>
[24] Richard S. Sutton and Andrew G. Barto. Reinforcement learning: An introduction. MIT Press,
2018.<br><br>
[25] Sergios Theodoridis and Konstantinos Koutroumbas. Pattern Recognition. 3rd. Academic Press,
2006. isbn: 0123695317.<br><br>
[26] Alan M. Turing. “Computing Machinery and Intelligence”. In: Mind 59.236 (1950), pp. 433 –460.<br><br>
[27] P. Winston. Artificial Intelligence. 3rd. Pearson, 1992.<br><br>