# Case Study: Predicting Customer Purchase Behavior using Decision Tree with Grid Search

In this case study, we use a Decision Tree classifier to predict customer purchases based on features like age, income, browsing time, and product category. We will then use Grid Search to tune hyperparameters and improve the model's performance.

In [None]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
from sklearn.model_selection import GridSearchCV

# Data Generation

In [None]:
# Set random seed for reproducibility
np.random.seed(42)

# Generate 1000 samples with random data for features
n_samples = 5000

# Age: Random integers between 18 and 65
age = np.random.randint(18, 66, size=n_samples)

# Income: Random integers between 30000 and 120000
income = np.random.randint(30000, 120001, size=n_samples)

# Browsing Time: Random float between 0 and 5 hours
browsing_time = np.random.uniform(0, 5, size=n_samples)

# Product Category: Randomly chosen from ['Electronics', 'Clothing', 'Groceries', 'Toys']
product_category = np.random.choice(['Electronics', 'Clothing', 'Groceries', 'Toys'], size=n_samples)

# Purchased: Target variable (0 = No, 1 = Yes), based on a simple rule: income and browsing time
purchased = np.random.choice([0, 1], size=n_samples, p=[0.7, 0.3])  # 30% purchase rate

# Creating the DataFrame
df = pd.DataFrame({
    'Age': age,
    'Income': income,
    'Browsing Time': browsing_time,
    'Product Category': product_category,
    'Purchased': purchased
})
df.head()

# Train-test split

In [None]:
# Load dataset (Assume it's already cleaned)
# Feature selection
X = df[['Age', 'Income', 'Browsing Time', 'Product Category']]
y = df['Purchased']

# One-hot encode 'Product Category' since it's categorical
X = pd.get_dummies(X, drop_first=True)  # This converts 'Product Category' into numerical columns

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Standardize the features

In [None]:
# Standardize the features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

### Step 2: Initial Decision Tree Model

Next, let's train a simple decision tree model and evaluate its performance without hyperparameter tuning.

In [None]:
# Initialize a Decision Tree Classifier
dt = DecisionTreeClassifier(random_state=42)

# Train the model
dt.fit(X_train_scaled, y_train)

# Make predictions
y_pred = dt.predict(X_test_scaled)

# Evaluate accuracy
print(f'Accuracy without tuning: {accuracy_score(y_test, y_pred):.4f}')

### Step 3: Hyperparameter Tuning with Grid Search

Now, let's perform hyperparameter tuning using Grid Search to find the best parameters for the Decision Tree model.

In [None]:
# Define parameter grid for Decision Tree
param_grid = {
    'criterion': ['gini', 'entropy'],
    'max_depth': [None, 5, 7, 10],
    'min_samples_split': [5, 10, 15],
    'min_samples_leaf': [1, 3, 5],
    'max_features': [None, 'sqrt', 'log2']
}

# Initialize GridSearchCV with 5-fold cross-validation
grid_search = GridSearchCV(estimator=dt, param_grid=param_grid, cv=5, n_jobs=-1, verbose=1)

# Fit the model with Grid Search
grid_search.fit(X_train_scaled, y_train)

# Best parameters and score from Grid Search
print('Best Parameters: ', grid_search.best_params_)
print('Best Cross-Validation Score: ', grid_search.best_score_)

### Step 4: Evaluate the Best Model

Once the best hyperparameters are found, we use the best model to make predictions and evaluate its performance on the test set.

In [None]:
# Get the best model from grid search
best_dt = grid_search.best_estimator_

# Make predictions using the best model
y_pred_best = best_dt.predict(X_test_scaled)

# Evaluate accuracy of the best model
print(f'Accuracy with Grid Search tuning: {accuracy_score(y_test, y_pred_best):.4f}')

### Step 5: Model Comparison

Finally, we compare the accuracy of the initial model (without tuning) and the tuned model (with Grid Search).

In [None]:
# Comparison of accuracy
initial_accuracy = accuracy_score(y_test, y_pred)
tuned_accuracy = accuracy_score(y_test, y_pred_best)

print(f'Accuracy without tuning: {initial_accuracy:.4f}')
print(f'Accuracy with tuning (Grid Search): {tuned_accuracy:.4f}')

### Conclusion

Grid Search allows for systematic hyperparameter tuning and typically results in a more accurate model. By adjusting the hyperparameters of the Decision Tree, such as `max_depth`, `min_samples_split`, and `min_samples_leaf`, we can improve the model's performance and avoid overfitting or underfitting.

### Additional Notes

- **GridSearchCV** performs exhaustive search over the specified hyperparameters and uses cross-validation to evaluate each combination.
- You can further refine the search by using **RandomizedSearchCV** if the search space is too large.