# Feature selection

In [17]:
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

X, y = load_breast_cancer(return_X_y=True)
print('X', X.shape)
print('y', y.shape)

X_train, X_test, y_train, y_test = train_test_split(X, y, 
                                                    test_size=0.2,
                                                    random_state=42)
model_overfit = LogisticRegression(max_iter=10000)
model_overfit.fit(X_train, y_train)

y_pred = model_overfit.predict(X_test)
accuracy_overfit = accuracy_score(y_pred, y_test)
print("Test accuracy: ", accuracy_overfit)

X (569, 30)
y (569,)
Test accuracy:  0.956140350877193


In [18]:
from sklearn.feature_selection import SelectKBest, f_classif
# SelectKBest: select K best features by a criteria
# f_classif: ANOVA F-test - calculate the correlation between features and output

k_best = SelectKBest(f_classif, k=10)
X_train_selected = k_best.fit_transform(X_train, y_train)
X_test_selected = k_best.transform(X_test)

model_selected = LogisticRegression(max_iter=10000)
model_selected.fit(X_train_selected, y_train)

y_pred = model_selected.predict(X_test_selected)
accruacy_selected = accuracy_score(y_pred, y_test)
print("Test accracy with selected features:" , accruacy_selected)



Test accracy with selected features: 0.9912280701754386


## Using Chi-square

![Chi-Square](https://cdn1.byjus.com/wp-content/uploads/2020/10/Chi-Square-Test.png)

The theory is pretty long, you can search in internet, but for the code, all you have to know is that the bigger the chi-square number, the bigger the correlation between this feature and the ouput



| Patient ID | Age   | Heart Disease |
|-------------|--------|---------------|
| 1 | Young | Yes |
| 2 | Old | No |
| 3 | Middle | Yes |
| 4 | Old | No |
| 5 | Young | Yes |
| 6 | Old | No |
| 7 | Middle | Yes |
| 8 | Middle | No |

We summarize the counts in a contingency table:

| Age     | Heart Disease = Yes | Heart Disease = No | Total |
|----------|---------------------|--------------------|--------|
| Young    | 2 | 0 | 2 |
| Middle   | 2 | 1 | 3 |
| Old      | 0 | 3 | 3 |
| **Total**| 4 | 4 | 8 |



For each cell, the expected frequency is calculated as:

$$
E_{ij} = \frac{(\text{Row Total})_i \times (\text{Column Total})_j}{\text{Grand Total}}
$$

| Age     | Yes (expected) | No (expected) |
|----------|----------------|---------------|
| Young    | (2×4)/8 = 1.0  | (2×4)/8 = 1.0 |
| Middle   | (3×4)/8 = 1.5  | (3×4)/8 = 1.5 |
| Old      | (3×4)/8 = 1.5  | (3×4)/8 = 1.5 |


The Chi-square statistic is defined as:

$$
\chi^2 = \sum \frac{(O - E)^2}{E}
$$

| Age     | Yes | No | Chi-square contribution |
|----------|------|----|------------------------|
| Young    | (2−1)²/1 = 1.0 | (0−1)²/1 = 1.0 | 2.0 |
| Middle   | (2−1.5)²/1.5 = 0.1667 | (1−1.5)²/1.5 = 0.1667 | 0.3334 |
| Old      | (0−1.5)²/1.5 = 1.5 | (3−1.5)²/1.5 = 1.5 | 3.0 |
| **Total** |   |   | **χ² = 2.0 + 0.3334 + 3.0 = 5.3334** |


The degrees of freedom (df) are:

$$
df = (r - 1)(c - 1)
$$

where $r$ = number of rows and $c$ = number of columns.

Here:

$$
df = (3 - 1)(2 - 1) = 2
$$


- **Null hypothesis ($H_0$)**: Age and Heart Disease are independent  
- **Alternative hypothesis ($H_1$)**: Age and Heart Disease are dependent  

Using a significance level of $\alpha = 0.05$,  
the critical Chi-square value for $df = 2$ is approximately **5.991**.


Since:

$$
\chi^2_{observed} = 5.33 < 5.991 = \chi^2_{critical}
$$

We **fail to reject $H_0$**.


**Conclusion**

At the 5% significance level, there is **no statistically significant association** between **Age** and **Heart Disease**.

In other words, the variables appear to be **independent**.


In [19]:
from scipy.stats import chi2_contingency
import numpy as np

data = load_breast_cancer()
X = data.data
y = data.target
X_train, X_test, y_train, y_test = train_test_split(X, y, 
                                                    test_size=0.2,
                                                    random_state=42)

chi2_values = []

# Calculate the chi2 score for each feature
for feature_idx in range(X.shape[1]):
    observed_values = np.column_stack((X[:, feature_idx], y))
    chi2, _, _, _ = chi2_contingency(observed_values)
    chi2_values.append(chi2)

# Sort the value (we want the largest chi2)
# But return an arrray of the index
sorted_feature_index = np.argsort(chi2_values)[::-1]

# Filter
num_selected_features = 10
seleted_feature_index = sorted_feature_index[:num_selected_features]

selected_feature_names = np.array(data.feature_names)[seleted_feature_index]
print("Features have been selected:", selected_feature_names)


# The reset just to fit the model
X_train_selected = X_train[:, seleted_feature_index]
X_test_selected = X_test[:, seleted_feature_index]

model_selected = LogisticRegression(max_iter=10000)
model_selected.fit(X_train_selected, y_train)

y_pred = model_selected.predict(X_test_selected)
selected_accuracy = accuracy_score(y_pred, y_test)

print("Accuracy with features selected by chi-square:", selected_accuracy)


Features have been selected: ['area error' 'worst area' 'mean area' 'perimeter error' 'worst perimeter'
 'worst radius' 'mean perimeter' 'mean radius' 'worst texture'
 'worst concavity']
Accuracy with features selected by chi-square: 0.9649122807017544


## Using F-score (F-value)

It's the same if we use f_classif library from sklearn (we have coded above), but ye, here's what the same way, result will also be the same, of course

- In the code, **F-score is used to rank features**.
- Higher F → more important feature for classification.
- If you want to know more about it, then read the part below **(optional)**

### F-score theory

- F-test checks:
> "Do the means of a feature differ significantly across different classes?"

- If **class means are very different**, the feature is likely **important**.
- If **class means are similar**, the feature is less informative.


Suppose we have:

- $ K $ classes
- $ N $ total samples
- Class $ k $ has $ n_k $ samples
- $ x_{ki} $ is the value of feature $ x $ for sample $ i $ in class $ k $
- $ \bar{x}_k $ = mean of class $ k $
- $ \bar{x} $ = overall mean of the feature


$F = \frac{MS_{between}}{MS_{within}}$

Where:
- **Between-class mean square (MS_between)**
$MS_{between} = \frac{1}{K-1} \sum_{k=1}^{K} n_k (\bar{x}_k - \bar{x})^2$

- **Within-class mean square (MS_within)**
$MS_{within} = \frac{1}{N-K} \sum_{k=1}^{K} \sum_{i=1}^{n_k} (x_{ki} - \bar{x}_k)^2$


- $ (\bar{x}_k - \bar{x})^2$ → how far class mean is from overall mean
- $ (x_{ki} - \bar{x}_k)^2 $→ variance within the class
- **High F-score** → feature discriminates well between classes
- **Low F-score** → feature is less informative

---


### Simple Example

Suppose we have **1 feature** and **2 classes**:

| Class | Feature values \(x\) |
|-------|---------------------|
| 0     | [1, 2, 3]           |
| 1     | [7, 8, 9]           |

**Step 1: Compute class means and overall mean**

$\bar{x}_0 = 2, \quad \bar{x}_1 = 8, \quad \bar{x} = 5$

**Step 2: Between-class sum of squares (SSB)**

$SS_B = n_0(\bar{x}_0 - \bar{x})^2 + n_1(\bar{x}_1 - \bar{x})^2 = 3(2-5)^2 + 3(8-5)^2 = 54$

**Step 3: Within-class sum of squares (SSW)**

$SS_W = \sum (x_{ki} - \bar{x}_k)^2 = 4$

**Step 4: Mean squares**

$MS_B = SS_B / (K-1) = 54 / 1 = 54$
$MS_W = SS_W / (N-K) = 4 / (6-2) = 1$

**Step 5: F-score**

$F = MS_B / MS_W = 54 / 1 = 54$

So the result we have is large F-score → feature strongly discriminates between the two classes.



In [20]:
import numpy as np
from sklearn.datasets import load_breast_cancer
from scipy.stats import f_oneway

data = load_breast_cancer()
X = data.data
y = data.target
X_train, X_test, y_train, y_test = train_test_split(X, y, 
                                                    test_size=0.2,
                                                    random_state=42)

# Split the data by class
X_class0 = X[y==0]
X_class1 = X[y==1]

f_scores = []
for feature_idx in range(X.shape[1]):
    f_score, _ = f_oneway(X_class0[:, feature_idx], X_class1[:, feature_idx])
    f_scores.append(f_score)

# This part is pretty the same with chi2
sorted_feature_index = np.argsort(f_scores)[::-1]

num_selected_features = 10
selected_feature_index  = sorted_feature_index[:num_selected_features]

selected_feature_names = np.array(data.feature_names)[selected_feature_index]
print("Features selected by f-score:", selected_feature_names)

# ¬_¬
X_train_selected = X_train[:, selected_feature_index]
X_test_selected = X_test[:, selected_feature_index]

model_selected = LogisticRegression(max_iter=10000)
model_selected.fit(X_train_selected, y_train)

y_pred = model_selected.predict(X_test_selected)
selected_accuracy = accuracy_score(y_pred, y_test)

print("Accuracy with features selected by f-score:", selected_accuracy)

Features selected by f-score: ['worst concave points' 'worst perimeter' 'mean concave points'
 'worst radius' 'mean perimeter' 'worst area' 'mean radius' 'mean area'
 'mean concavity' 'worst concavity']
Accuracy with features selected by f-score: 0.9912280701754386


# Ensemble

By combining multiple models (weak learners) -> create a stronger model (strong leaner)

In [21]:
import numpy as np
from sklearn.datasets import load_iris, load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

data = load_breast_cancer()
X = data.data  
y = data.target  

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [22]:
model = DecisionTreeClassifier(max_depth=1, random_state=42)

model.fit(X_train, y_train)

y_pred = model.predict(X_test)

accuracy = accuracy_score(y_test, y_pred)
print("Test accuracy: ", accuracy)

Test accuracy:  0.8947368421052632


## Bagging Ensemble

Train multiple independent models by boostraping the data, then take the average or take a poll to get the result

In [25]:
X.shape, y.shape

((569, 30), (569,))

In [29]:
# We will create n_estimators KNN models
n_estimators = 100
boostrap_samples = 50

weak_models = []

for _ in range(n_estimators):
    boostrap_index = np.random.choice(X_train.shape[0], boostrap_samples, replace=True)
    X_boostrap = X_train[boostrap_index]
    y_boostrap = y_train[boostrap_index]
    
    knn = KNeighborsClassifier(n_neighbors=3)
    knn.fit(X_boostrap, y_boostrap)
    weak_models.append(knn)

predictions = np.zeros((y_test.shape[0], n_estimators))
for i, knn in enumerate(weak_models):
    predictions[:, i] = knn.predict(X_test)

ensemble_predictions = np.round(np.mean(predictions, axis=1))

accuracy = accuracy_score(ensemble_predictions, y_test)
print(f"Baggin (KNN) accuracy: {accuracy:.2f}")

Baggin (KNN) accuracy: 0.95


# Boostring Ensemble

Training models continuously, while each model focus on the sample that the previous model predict wrong


We start by assigning **equal weights** to all training samples.

$$
w_i = \frac{1}{N}
$$

where $N$ is the total number of training samples.
At each iteration $t$:

1. Train a **Decision Stump** (a one-level Decision Tree) using the current sample weights.
2. Let the model be $h_t(x)$.

$$
\text{error}_t = \frac{\sum_i w_i [y_i \neq h_t(x_i)]}{\sum_i w_i}
$$

This represents how often the model misclassifies samples, taking their weights into account.


The weight of the weak learner is:

$$
\alpha_t = \frac{1}{2} \ln \left( \frac{1 - \text{error}_t}{\text{error}_t} \right)
$$

- If the model performs well → low error → large $ \alpha_t $
- If the model performs poorly (error > 0.5) → negative $ \alpha_t $


After each iteration, increase the weight of the **misclassified** samples so the next learner focuses on them.

$$
w_i \leftarrow w_i \cdot e^{-\alpha_t y_i h_t(x_i)}
$$

Then normalize:

$$
w_i \leftarrow \frac{w_i}{\sum_j w_j}
$$

- If $ y_i = h_t(x_i) $ → correctly classified → weight decreases  
- If $ y_i \neq h_t(x_i) $ → misclassified → weight increases


The final strong classifier is a weighted majority vote:

$$
H(x) = \text{sign} \left( \sum_{t=1}^{T} \alpha_t h_t(x) \right)
$$

- Each model votes with strength proportional to its accuracy  ($\alpha_t$)
- The sign of the weighted sum gives the final class label



In [30]:
n_estimators = 50

estimators = []

weights = np.ones(len(X_train)) / len(X_train)

for _ in range(n_estimators):
    tree = DecisionTreeClassifier(max_depth=1)
    tree.fit(X_train, y_train, sample_weight=weights)

    y_pred = tree.predict(X_train)

    error = np.sum(weights * (y_pred != y_train)) / np.sum(weights)
    tree_weight = 0.5 * np.log((1-error) / error)
    weights = weights * np.exp(-tree_weight * y_train * y_pred)

    weights /= np.sum(weights)

    estimators.append((tree, tree_weight))

y_pred_ensemble = np.zeros_like(y_test, dtype=float)
for tree, tree_weight in estimators:
    y_pred_tree = tree.predict(X_test)
    y_pred_ensemble += tree_weight * y_pred_tree

y_pred_ensemble = np.sign(y_pred_ensemble)

accuracy = accuracy_score(y_test, y_pred_ensemble)
print("Accuracy of Boosting (KNN)", accuracy)


Accuracy of Boosting (KNN) 0.9210526315789473


## Stacking Ensemble

This one is pretty simple so i won't explain anymore

In [32]:
# Base models
rf_model = RandomForestClassifier(n_estimators=100, random_state=42)
knn_model = KNeighborsClassifier(n_neighbors=3)
lr_model = LogisticRegression()

rf_model.fit(X_train, y_train)
knn_model.fit(X_train, y_train)
lr_model.fit(X_train, y_train)

rf_pred = rf_model.predict(X_test)
knn_pred = knn_model.predict(X_test)
lr_pred = lr_model.predict(X_test)

stacked_predictions = np.column_stack((rf_pred, knn_pred, lr_pred))

# Meta learner
meta_learner = LogisticRegression()
meta_learner.fit(stacked_predictions, y_test)
final_predictions = meta_learner.predict(stacked_predictions)


accuracy = accuracy_score(y_test, final_predictions)
print(f"Accuracy of Stacking Ensemble: {accuracy:.2f}")

STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


Accuracy of Stacking Ensemble: 0.96
