# Bank Marketing Decision Tree Classifier

This notebook builds a decision tree classifier to predict whether a client will subscribe to a term deposit based on various features from the Bank Marketing dataset.

## 1. Setup and Data Loading

First, let's import the necessary libraries and load our dataset.

In [None]:
# Import required libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

# Set random seed for reproducibility
np.random.seed(42)

# Set style for visualizations
plt.style.use('seaborn')
sns.set_palette('Set2')

In [None]:
# Load the dataset
df = pd.read_csv('../data/bank-marketing.csv', sep=';')

# Display basic information about the dataset
print("Dataset Info:")
df.info()

# Display first few rows
print("\nFirst few rows:")
df.head()

## 2. Data Preprocessing

Let's prepare our data for the decision tree classifier by handling categorical variables and splitting the data.

In [None]:
# Function to encode categorical variables
def encode_categorical(df):
    le = LabelEncoder()
    categorical_cols = df.select_dtypes(include=['object']).columns
    
    for col in categorical_cols:
        df[col] = le.fit_transform(df[col])
    
    return df

# Encode categorical variables
df_encoded = encode_categorical(df.copy())

# Split features and target
X = df_encoded.drop('y', axis=1)
y = df_encoded['y']

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print("Training set shape:", X_train.shape)
print("Testing set shape:", X_test.shape)

## 3. Exploratory Data Analysis

Let's explore our data to understand the distributions and relationships.

In [None]:
# Plot target distribution
plt.figure(figsize=(8, 6))
sns.countplot(data=df, x='y')
plt.title('Distribution of Target Variable (Term Deposit Subscription)')
plt.show()

# Calculate target distribution percentages
target_dist = df['y'].value_counts(normalize=True) * 100
print("\nTarget Distribution:")
for label, percentage in target_dist.items():
    print(f"{label}: {percentage:.2f}%")

## 4. Building the Decision Tree Classifier

Now let's create and train our decision tree model.

In [None]:
# Create and train the decision tree classifier
dt_classifier = DecisionTreeClassifier(random_state=42, max_depth=5)
dt_classifier.fit(X_train, y_train)

# Make predictions
y_pred = dt_classifier.predict(X_test)

# Print model performance
print("Model Accuracy:", accuracy_score(y_test, y_pred))
print("\nClassification Report:")
print(classification_report(y_test, y_pred))

## 5. Model Visualization and Feature Importance

In [None]:
# Plot decision tree
plt.figure(figsize=(20,10))
plot_tree(dt_classifier, feature_names=X.columns, class_names=['no', 'yes'], 
          filled=True, rounded=True)
plt.title('Decision Tree Visualization')
plt.show()

# Plot feature importance
feature_importance = pd.DataFrame({
    'feature': X.columns,
    'importance': dt_classifier.feature_importances_
})
feature_importance = feature_importance.sort_values('importance', ascending=False)

plt.figure(figsize=(12,6))
sns.barplot(data=feature_importance.head(10), x='importance', y='feature')
plt.title('Top 10 Most Important Features')
plt.show()

## 6. Model Evaluation

In [None]:
# Plot confusion matrix
cm = confusion_matrix(y_test, y_pred)
plt.figure(figsize=(8,6))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues')
plt.title('Confusion Matrix')
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.show()

# Calculate and print additional metrics
tn, fp, fn, tp = cm.ravel()
precision = tp / (tp + fp)
recall = tp / (tp + fn)
f1_score = 2 * (precision * recall) / (precision + recall)

print("\nAdditional Metrics:")
print(f"Precision: {precision:.4f}")
print(f"Recall: {recall:.4f}")
print(f"F1 Score: {f1_score:.4f}")

## 7. Conclusions and Recommendations

Based on our analysis:
1. Model Performance: [Will be filled after running]
2. Key Features: [Will be filled after running]
3. Areas for Improvement: [Will be filled after running]

Recommendations:
1. [Will be filled after running]
2. [Will be filled after running]
3. [Will be filled after running]