# Decision Tree Classifier from Scratch
Implementation of a decision tree classifier for loan approval prediction

## 1. Data Loading & Preprocessing
Loading loan approval dataset and initial exploration

In [None]:
import numpy as np 
...
loan_df = pd.read_csv('loan.csv')

### Dataset Overview
The dataset contains:
- **Features**:
  - Demographic information (age, gender, marital status)
  - Financial information (income, credit score)
  - Education level and occupation
- **Target**: Loan status (Approved/Denied)

**Initial Preprocessing**:
- Mapping categorical variables to numerical values
- Dropping high-cardinality features (occupation)

## 2. Core Algorithm Components

### 2.1 Entropy Calculation
Measures impurity using Shannon entropy:
$$ H(S) = -\sum_{i=1}^c p_i \log_2 p_i $$
Where $p_i$ is the proportion of class $i$ in set $S$

In [None]:
def entropy(p):
    ...

### 2.2 Information Gain
Calculates reduction in entropy after split:
$$ IG(S,A) = H(S) - \sum_{t\in T} \frac{|S_t|}{|S|} H(S_t) $$
Where:
- $H(S)$ = entropy of parent node
- $S_t$ = subset of data for split $t$

In [None]:
def information_gain(X, y, indices, feature):
    ...

### 2.3 Splitting Criteria
Custom splitting logic for different feature types:
- **Numerical**: Threshold-based splits (age > 29)
- **Categorical**: Group-based splits (education levels)
- **Binary**: Direct value splits

In [None]:
def splitting(X, indices, feature):
    ...

## 3. Data Exploration & Feature Engineering

### Feature-Target Relationships
Visualizing relationships between key features and loan status

In [None]:
# Enhanced feature visualization
plt.figure(figsize=(15, 5))

# Age vs Loan Status
plt.subplot(1, 3, 1)
sns.boxplot(x=y, y=X['age'])
plt.title('Age Distribution by Loan Status')
plt.xticks([0, 1], ['Denied', 'Approved'])

# Income vs Loan Status
plt.subplot(1, 3, 2)
sns.violinplot(x=y, y=X['income'])
plt.title('Income Distribution by Loan Status')

# Credit Score vs Loan Status
plt.subplot(1, 3, 3)
sns.histplot(data=X, x='credit_score', hue=y, element='step', stat='density')
plt.title('Credit Score Distribution by Loan Status')

plt.tight_layout()
plt.show()

## 4. Tree Construction & Training

### Tree Building Process
Recursive tree construction with depth control:
- Maximum depth stopping criterion
- Leaf node creation when pure node reached

In [None]:
def build_tree_recursive(...):
    ...

## 5. Model Evaluation & Visualization

### Class Distribution Analysis
Comparing class balance in training and test sets

In [None]:
# Class distribution visualization
fig, ax = plt.subplots(1, 2, figsize=(12, 5))

train_counts = y_train.value_counts().sort_index()
test_counts = y_test.value_counts().sort_index()

ax[0].pie(train_counts, labels=['Denied', 'Approved'], autopct='%1.1f%%')
ax[0].set_title('Training Set Class Distribution')

ax[1].pie(test_counts, labels=['Denied', 'Approved'], autopct='%1.1f%%')
ax[1].set_title('Test Set Class Distribution')

plt.show()

### Decision Surface Visualization
2D decision boundaries using key features

In [None]:
# Decision boundary plot
def plot_decision_boundary(X, y, tree):
    # Select two most important features
    features = ['credit_score', 'income']
    X_plot = X[features]
    
    # Create mesh grid
    x_min, x_max = X_plot.iloc[:, 0].min()-1, X_plot.iloc[:, 0].max()+1
    y_min, y_max = X_plot.iloc[:, 1].min()-1, X_plot.iloc[:, 1].max()+1
    xx, yy = np.meshgrid(np.arange(x_min, x_max, 100),
                         np.arange(y_min, y_max, 100))

    # Predict for grid points
    grid_points = pd.DataFrame(np.c_[xx.ravel(), yy.ravel()], 
                              columns=features)
    Z = predict(grid_points, tree, y_train)
    Z = Z.values.reshape(xx.shape)

    # Plot decision surface
    plt.figure(figsize=(10, 6))
    plt.contourf(xx, yy, Z, alpha=0.4, cmap='coolwarm')
    plt.scatter(X_plot.iloc[:, 0], X_plot.iloc[:, 1], c=y, 
                edgecolor='k', cmap='coolwarm')
    plt.xlabel('Credit Score')
    plt.ylabel('Income')
    plt.title('Decision Boundary Visualization')
    plt.show()

plot_decision_boundary(X_test, y_test, tree)

### Feature Importance Analysis
Visualizing feature importance based on split frequency

In [None]:
# Feature importance visualization
def get_feature_importance(tree):
    importance = {feature: 0 for feature in X.columns}
    for node in tree:
        if len(node) == 3:
            importance[X.columns[node[2]]] += 1
    
    plt.figure(figsize=(10, 5))
    pd.Series(importance).sort_values().plot(kind='barh')
    plt.title('Feature Importance Based on Split Frequency')
    plt.xlabel('Number of Splits')
    plt.ylabel('Features')
    plt.show()

get_feature_importance(tree)