In [None]:
# ! pip install ucimlrepo

## Dataset

We will use the **Adult Income Dataset** from UCI.

- **Dataset Link:** [Adult Dataset (UCI Machine Learning Repository)](https://archive.ics.uci.edu/ml/datasets/adult)
- **Task:** Binary Classification — predict whether a person earns **≤ 50K** or **> 50K** per year.
- **Features:**  
  Mix of **categorical** and **numeric** features:
  - *Categorical:* `workclass`, `education`, `occupation`, `marital-status`, etc.
  - *Numeric:* `age`, `hours-per-week`, `capital-gain`, `capital-loss`, etc.
- **Target Variable:**  
  `income` — indicates whether the person earns **≤ 50K** or **> 50K**.



In [None]:
from ucimlrepo import fetch_ucirepo

In [3]:
adult = fetch_ucirepo(id=2)
X = adult.data.features # features (pandas DataFrame)
y = adult.data.targets # target (pandas DataFrame)

In [4]:
# metadata 
print(adult.metadata) 
  
# variable information 
print(adult.variables) 

{'uci_id': 2, 'name': 'Adult', 'repository_url': 'https://archive.ics.uci.edu/dataset/2/adult', 'data_url': 'https://archive.ics.uci.edu/static/public/2/data.csv', 'abstract': 'Predict whether annual income of an individual exceeds $50K/yr based on census data. Also known as "Census Income" dataset. ', 'area': 'Social Science', 'tasks': ['Classification'], 'characteristics': ['Multivariate'], 'num_instances': 48842, 'num_features': 14, 'feature_types': ['Categorical', 'Integer'], 'demographics': ['Age', 'Income', 'Education Level', 'Other', 'Race', 'Sex'], 'target_col': ['income'], 'index_col': None, 'has_missing_values': 'yes', 'missing_values_symbol': 'NaN', 'year_of_dataset_creation': 1996, 'last_updated': 'Tue Sep 24 2024', 'dataset_doi': '10.24432/C5XW20', 'creators': ['Barry Becker', 'Ronny Kohavi'], 'intro_paper': None, 'additional_info': {'summary': "Extraction was done by Barry Becker from the 1994 Census database.  A set of reasonably clean records was extracted using the fol

## 🧹 1. Data Preparation

- Handle missing values (drop or impute).  
- Encode categorical variables into numeric values (e.g., Label Encoding).  
- Split the dataset as follows:
  - **80% Training**
  - **20% Validation**
  - **20% Test**

> Use the validation set to tune tree depth and pruning parameters.


## 🌳 2. Build a Decision Tree from Scratch

Implement the tree recursively:

1. **At each split:**
   - Compute both **Gini Impurity** and **Entropy**.
   - For each feature and possible split, calculate the **weighted impurity** of child nodes.
   - Choose the split with the **highest information gain** (lowest impurity).

2. **Continue splitting until:**
   - All samples in a node have the same label, **OR**
   - The **maximum depth** is reached, **OR**
   - There is **no further improvement** in impurity.

3. **Implement a function** to predict labels for new samples.


## ✂️ 3. Pre-Pruning (Restricting Tree Growth)

While building the tree, apply pre-pruning techniques:

- Limit **maximum depth** (try depths = 2, 4, 6, and unlimited).  
- Require at least a **minimum number of samples** (e.g., 5) to split.  
- Optionally, require a **minimum impurity decrease** to split further.


## 🪚 4. Post-Pruning (Reduced Error Pruning)

Steps for reduced error pruning:

1. First, **grow a full tree**.  
2. For each internal node:
   - Replace it with a **leaf node** (majority class).
   - Evaluate **validation accuracy**.
3. If accuracy **does not decrease**, keep the pruning.  
4. Repeat until no further improvement is observed.


## 🧾 5. Evaluation

- Train using the **training set**.
- Tune depth and pruning using the **validation set**.
- Report final results on the **test set**.

### Metrics to Report
- **Accuracy**
- **Precision**
- **Recall**
- **F1-score**
- **Confusion Matrix**

> Compare your implementation with `sklearn.tree.DecisionTreeClassifier`.


## 🔬 6. Experiments to Perform

Perform and report the following experiments:

1. **Compare Gini vs. Entropy.**  
2. **Compare different depths** (2, 4, 6, unlimited).  
3. **Show the effect of pruning** (pre-pruned vs. post-pruned vs. full tree).  
4. **Identify the most important features** — the ones used near the top of the tree.
