<a href="https://colab.research.google.com/github/samiha-mahin/Data-Analysis/blob/main/Feature_Extraction_Method.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#**1.Chi-Square (χ²) feature extraction method**

### 🌟 What is Chi-Square Feature Extraction?

The **Chi-Square (χ²) test** is a **statistical method** used in **feature selection** to find out whether two variables (usually **feature and target**) are **independent**.

In feature selection, we use the Chi-Square test to **rank categorical input features** based on how strongly they are related to the **categorical output (class label)**.

---

### 🔍 When to Use:

* Input: **Categorical features**
* Output (target): **Categorical class label**

> ⚠️ If you have **numeric data**, you need to **convert it to categorical** (e.g., using binning).

---

### 🧠 Key Idea:

Chi-Square score =

$$
\chi^2 = \sum \frac{(O - E)^2}{E}
$$

Where:

* $O$: Observed frequency (from the dataset)
* $E$: Expected frequency (if there was no relationship)

If $χ^2$ is **high**, the feature and label are **dependent** → good feature.
If $χ^2$ is **low**, the feature and label are **independent** → bad feature.

---

### 📊 Example

Let’s say we are trying to predict whether a person **buys a phone** (Yes/No) based on **Age Group**.

| Age Group | Buys Phone: Yes | Buys Phone: No |
| --------- | --------------- | -------------- |
| Teen      | 10              | 30             |
| Adult     | 30              | 10             |
| Senior    | 20              | 20             |

---

### 🧮 Step-by-step Calculation

#### Step 1: Total values

* Total people: 10+30+20+30+10+20 = 120
* Total Yes: 10+30+20 = 60
* Total No: 30+10+20 = 60

---

#### Step 2: Expected Frequencies (E)

Expected = (Row Total × Column Total) / Grand Total

Example: For "Teen-Yes" =

$$
\text{Expected} = \frac{(Teen Total) × (Yes Total)}{Grand Total} = \frac{40 × 60}{120} = 20
$$

Compute for all cells:

| Age Group | Yes (O, E) | No (O, E) |
| --------- | ---------- | --------- |
| Teen      | 10, 20     | 30, 20    |
| Adult     | 30, 20     | 10, 20    |
| Senior    | 20, 20     | 20, 20    |

---

#### Step 3: Compute Chi-Square Score

Use formula:

$$
\chi^2 = \sum \frac{(O - E)^2}{E}
$$

$$
χ^2 = \frac{(10 - 20)^2}{20} + \frac{(30 - 20)^2}{20} + \cdots
= \frac{100}{20} + \frac{100}{20} + 0 + 0 + 0 + 0 = 10 + 10 = 20
$$

So, **χ² = 20** (a high score → strong relationship).

---


### 💡 Summary

| Aspect                | Chi-Square Test                        |
| --------------------- | -------------------------------------- |
| Data type required    | Categorical                            |
| Purpose               | Feature selection                      |
| Measures              | Dependency between feature and label   |
| High Chi-Square value | Strong relationship → keep the feature |
| Low Chi-Square value  | Weak relationship → drop the feature   |




In [1]:
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2

# Sample dataset (X must be non-negative and categorical or encoded as integers)
X = [[1, 0, 2],
     [2, 1, 0],
     [3, 1, 1],
     [1, 0, 2]]

y = [0, 1, 1, 0]  # Class label

# Apply Chi-Square
chi2_selector = SelectKBest(score_func=chi2, k=2)
X_kbest = chi2_selector.fit_transform(X, y)

print("Selected Features:\n", X_kbest)
print("Scores:\n", chi2_selector.scores_)

Selected Features:
 [[0 2]
 [1 0]
 [1 1]
 [0 2]]
Scores:
 [1.28571429 2.         1.8       ]


#**2.Mutual Information (MI)**



## 🧠 What is Mutual Information?

Mutual Information (MI) tells us **how much information one variable gives us about another**.

In feature selection, we use MI to find out:

> ❓ “Does this feature tell me anything useful about the class label?”

If yes → It’s a good feature
If no → It’s a useless feature

---

### 🌟 Think of it Like This:

* If knowing feature **X** tells you a lot about the class **Y**, the **mutual information is high**.
* If knowing feature **X** gives you no clue about **Y**, the **mutual information is low (or zero)**.

---

### 🧮 Formula (Just for awareness):

$$
MI(X, Y) = \sum_{x,y} P(x,y) \cdot \log\left(\frac{P(x,y)}{P(x)P(y)}\right)
$$

But don't worry — in practice, **you don’t have to calculate this manually**. Libraries like `sklearn` do it for you.

---

## 🔍 Simple Example:

Let’s say we’re predicting whether a student **passed** or **failed**, and one of the features is whether they **studied or not**.

### Dataset:

| Studied | Result |
| ------- | ------ |
| Yes     | Pass   |
| No      | Fail   |
| Yes     | Pass   |
| No      | Fail   |
| Yes     | Pass   |

If every time someone studies, they pass → **Strong relationship**
Mutual Information = **High**

Now imagine:

| Studied | Result |
| ------- | ------ |
| Yes     | Pass   |
| Yes     | Fail   |
| No      | Pass   |
| No      | Fail   |

Here, studying has no pattern with the result.
Mutual Information = **Low (≈ 0)**

---


### 📝 Summary Table:

| Feature Selection Method | Works with             | Handles Non-Linear | Needs Normal Data? | Good for Imbalanced Data? |
| ------------------------ | ---------------------- | ------------------ | ------------------ | ------------------------- |
| **Mutual Information**   | Categorical or numeric | ✅ Yes              | ❌ No               | ✅ Yes (better than chi²)  |
| Chi-Square               | Categorical only       | ❌ No               | ❌ No               | ❌ Biased                  |

---



In [2]:
from sklearn.feature_selection import mutual_info_classif
from sklearn.feature_selection import SelectKBest

# Sample dataset
X = [[1, 0, 2],
     [2, 1, 0],
     [3, 1, 1],
     [1, 0, 2]]
y = [0, 1, 1, 0]

# Apply Mutual Information
mi = mutual_info_classif(X, y)
print("Mutual Info Scores:", mi)

# Select top 2 features
selector = SelectKBest(score_func=mutual_info_classif, k=2)
X_new = selector.fit_transform(X, y)

Mutual Info Scores: [0.20833333 0.83333333 0.45833333]


# **Most Important Feature Extraction/Selection Methods**

## **1. For Tabular Data (Structured Data)**

| Method                                 | Type                                     | Works On                        | Use When...                                                                                  |
| -------------------------------------- | ---------------------------------------- | ------------------------------- | -------------------------------------------------------------------------------------------- |
| **Chi-Square Test**                    | Filter Method                            | Categorical                     | You want to test dependence between feature and class (but watch out for imbalance).         |
| **Mutual Information**                 | Filter Method                            | Categorical/Discrete/Continuous | You want to capture both linear and non-linear dependencies.                                 |
| **ANOVA F-Test**                       | Filter Method                            | Numerical + Categorical Label   | For selecting features with high variance across classes.                                    |
| **Correlation Matrix**                 | Filter Method                            | Numerical                       | To remove redundant features (features that are highly correlated with each other).          |
| **L1 Regularization (Lasso)**          | Embedded Method                          | Numerical                       | You want to shrink unimportant features to zero directly in model training.                  |
| **Tree-based Feature Importance**      | Embedded Method                          | Any (auto-handles types)        | You want fast feature scoring with decision tree-based models (like Random Forest, XGBoost). |
| **PCA (Principal Component Analysis)** | Dimensionality Reduction                 | Numerical                       | You want to transform features to a smaller set of *uncorrelated* ones.                      |
| **Autoencoders**                       | Dimensionality Reduction (Deep Learning) | Numerical                       | You want to learn new compressed feature representations automatically.                      |

## **2. For Text Data (NLP)**

| Method               | Type                 | Description                                                                |
| -------------------- | -------------------- | -------------------------------------------------------------------------- |
| **TF-IDF**           | Extraction           | Calculates how important a word is in a document relative to a collection. |
| **Word2Vec / GloVe** | Embedding            | Converts words into dense vectors capturing semantic meaning.              |
| **BERT Embeddings**  | Contextual Embedding | Converts full sentences/words to vectors using transformer-based models.   |


## **3. For Image Data**

| Method                                    | Type                     | Description                                             |
| ----------------------------------------- | ------------------------ | ------------------------------------------------------- |
| **HOG (Histogram of Oriented Gradients)** | Traditional Extraction   | Good for object detection in classical computer vision. |
| **SIFT, SURF**                            | Traditional Extraction   | Keypoint detection, image matching.                     |
| **CNN Feature Maps**                      | Deep Learning Extraction | Extract features from intermediate CNN layers.          |


#**Which Should You Use?**

| Your Case                                        | Best Feature Methods                             |
| ------------------------------------------------ | ------------------------------------------------ |
| Tabular classification (categorical + numerical) | Mutual Info, Chi-Square, Lasso, Random Forest    |
| Imbalanced data                                  | SMOTE + Mutual Info or Tree-Based Importance     |
| Too many numerical features                      | PCA or Lasso                                     |
| Text classification                              | TF-IDF or BERT embeddings                        |
| Images                                           | CNN feature maps (e.g., using pretrained ResNet) |

