<a href="https://colab.research.google.com/github/swalehaparvin/kaggle_projects/blob/main/Decision_trees_vs_neural_networks.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Build a decision tree classifier to classify income levels based on multiple features including age, education level, and hours worked per week, and extract the learned rules that explain the decision. Then, compare its performance with an MLPClassifier trained on the same data.

X_train, X_test, y_train, and y_test are pre-loaded for you. The accuracy_score and export_text functions are also imported for you.

Train the MLPClassifier model.
Derive the predictions on the test set.
Compute the model's test accuracy.

In [2]:
import pandas as pd
import numpy as np
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier, export_text
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import accuracy_score

# Create a sample dataset
X, y = make_classification(
    n_samples=1000,           # Number of samples
    n_features=6,             # Number of features
    n_informative=4,          # Number of informative features
    n_redundant=1,            # Number of redundant features
    n_clusters_per_class=1,   # Number of clusters per class
    random_state=42
)

# Convert to DataFrame for better visualization
feature_names = ['age', 'income', 'education_years', 'experience', 'hours_worked', 'location_score']
X = pd.DataFrame(X, columns=feature_names)

# Add some realistic scaling to make features more interpretable
X['age'] = (X['age'] * 10 + 35).astype(int)  # Age between 25-45
X['income'] = (X['income'] * 20000 + 50000).astype(int)  # Income between 30k-70k
X['education_years'] = (X['education_years'] * 5 + 16).astype(int)  # Education 11-21 years
X['experience'] = (X['experience'] * 8 + 5).astype(int)  # Experience 0-15 years
X['hours_worked'] = (X['hours_worked'] * 15 + 40).astype(int)  # Hours 25-55
X['location_score'] = (X['location_score'] * 50 + 50).astype(int)  # Location score 0-100

print("Sample of the dataset:")
print(X.head())
print(f"\nTarget distribution: {np.bincount(y)}")

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.2,    # 20% for testing
    random_state=42,  # For reproducible results
    stratify=y        # Maintain class distribution
)

print(f"\nTraining set size: {X_train.shape}")
print(f"Testing set size: {X_test.shape}")

# Decision Tree Classifier
print("\n" + "="*50)
print("DECISION TREE CLASSIFIER")
print("="*50)

model = DecisionTreeClassifier(random_state=42, max_depth=3)
model.fit(X_train, y_train)

# Extract the rules
rules = export_text(model, feature_names=list(X_train.columns))
print("Decision Tree Rules:")
print(rules)

# Make predictions
y_pred = model.predict(X_test)

# Compute accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Decision Tree Accuracy: {accuracy:.3f}")

# Neural Network (MLP) Classifier
print("\n" + "="*50)
print("MULTI-LAYER PERCEPTRON CLASSIFIER")
print("="*50)

model_mlp = MLPClassifier(
    hidden_layer_sizes=(36, 12),
    random_state=42,
    max_iter=1000  # Increase max iterations to avoid convergence warnings
)

# Train the MLPClassifier
model_mlp.fit(X_train, y_train)

# Derive the predictions on the test set
y_pred_mlp = model_mlp.predict(X_test)

# Compute the test accuracy
accuracy_mlp = accuracy_score(y_test, y_pred_mlp)
print(f"MLP Accuracy: {accuracy_mlp:.3f}")

# Compare both models
print("\n" + "="*50)
print("MODEL COMPARISON")
print("="*50)
print(f"Decision Tree Accuracy: {accuracy:.3f}")
print(f"MLP Classifier Accuracy: {accuracy_mlp:.3f}")

if accuracy > accuracy_mlp:
    print("Decision Tree performed better!")
elif accuracy_mlp > accuracy:
    print("MLP Classifier performed better!")
else:
    print("Both models performed equally well!")

# Feature importance for Decision Tree
print("\nFeature Importance (Decision Tree):")
feature_importance = pd.DataFrame({
    'feature': X_train.columns,
    'importance': model.feature_importances_
}).sort_values('importance', ascending=False)

print(feature_importance)

Creating sample dataset...
Sample of the dataset:
   age  income  education_years  experience  hours_worked  location_score
0   32   36242                9          -4            54              67
1   31   79839                8           7            48              47
2   43   20735                1           2            62               4
3   28   27261                8          -4            26              96
4   17   48887               11           0           -10              69

Target distribution: [502 498]

Training set size: (800, 6)
Testing set size: (200, 6)

DECISION TREE CLASSIFIER
Decision Tree Rules:
|--- hours_worked <= 42.50
|   |--- income <= 71944.00
|   |   |--- hours_worked <= 38.50
|   |   |   |--- class: 0
|   |   |--- hours_worked >  38.50
|   |   |   |--- class: 0
|   |--- income >  71944.00
|   |   |--- hours_worked <= 2.50
|   |   |   |--- class: 0
|   |   |--- hours_worked >  2.50
|   |   |   |--- class: 1
|--- hours_worked >  42.50
|   |--- income <= 

**Note:** The following cell contains placeholder code for loading and splitting data. Please replace it with your actual data loading and splitting logic.

In [3]:
import pandas as pd
from sklearn.model_selection import train_test_split

# Placeholder code: Replace this with your actual data loading
# For example, if you have a CSV file named 'income_data.csv':
# df = pd.read_csv('income_data.csv')

# Assuming you have a DataFrame named 'df' with features and a target variable
# X = df.drop('income_level', axis=1) # Replace 'income_level' with your target column name
# y = df['income_level'] # Replace 'income_level' with your target column name

# Placeholder code: Replace this with your actual data splitting
# X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Example placeholder data (replace with your actual data)
from sklearn.datasets import make_classification
X, y = make_classification(n_samples=1000, n_features=20, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [4]:
from sklearn.tree import DecisionTreeClassifier, export_text
from sklearn.metrics import accuracy_score
from sklearn.neural_network import MLPClassifier # Import MLPClassifier

model = DecisionTreeClassifier(random_state=42, max_depth=2)
model.fit(X_train, y_train)

# Extract the rules
rules = export_text(model, feature_names=[f'feature_{i}' for i in range(X_train.shape[1])]) # Use generic feature names
print(rules)

y_pred = model.predict(X_test)

# Compute accuracy
accuracy = accuracy_score(y_test,y_pred)
print(f"Decision Tree Accuracy: {accuracy:.2f}")


model = MLPClassifier(hidden_layer_sizes=(36, 12), random_state=42)
# Train the MLPClassifier
model.fit(X_train, y_train)

# Derive the predictions on the test set
y_pred = model.predict(X_test)

# Compute the test accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"MLPClassifier Accuracy: {accuracy:.2f}")

|--- feature_5 <= -0.35
|   |--- feature_14 <= -1.81
|   |   |--- class: 1
|   |--- feature_14 >  -1.81
|   |   |--- class: 0
|--- feature_5 >  -0.35
|   |--- feature_18 <= -0.19
|   |   |--- class: 1
|   |--- feature_18 >  -0.19
|   |   |--- class: 1

Decision Tree Accuracy: 0.86
MLPClassifier Accuracy: 0.81


