# Marketing Campaign Response Prediction

## Project Overview

In this project, the objective is to develop a predictive model that identifies customers who are likely to respond to a marketing offer. By accurately predicting responses, the model aims to enhance the efficiency of marketing campaigns by either increasing the number of positive responses or reducing associated costs. 

## Dataset

The dataset contains customer information and behaviors, including:

- **Campaign Responses**: Acceptance of offers in previous campaigns (AcceptedCmp1-5).
- **Customer Complaints**: Whether a customer has complained in the last 2 years.
- **Customer Demographics**: Date of enrollment, education level, marital status, household composition, and yearly income.
- **Spending Patterns**: Amount spent on fish, meat, fruits, sweets, wine, and gold products over the last 2 years.
- **Purchase Methods**: Number of purchases made with discounts, via catalog, in stores, and through the website.
- **Online Behavior**: Number of visits to the company’s website in the last month.
- **Recency**: Number of days since the last purchase.

## Project Goal

The main goal is to train a predictive model that enables the company to maximize profits in the next marketing campaign by accurately targeting customers who are likely to respond positively to offers.


## Loading and Exploration

In [24]:

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix
import matplotlib.pyplot as plt
import seaborn as sns

# Read the Excel file
FILEPATH = 'marketing_campaign.xlsx'

# Load the Excel file, assuming data is in the first sheet
df = pd.read_excel(FILEPATH, sheet_name=0)  # sheet_name=0 loads the first sheet

# Display basic information about the dataset
print(df.info())
print("\nFirst few rows of the dataset:")
print(df.head())

# Check for missing values
print("\nMissing values:")
print(df.isnull().sum())





<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2240 entries, 0 to 2239
Data columns (total 29 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   ID                   2240 non-null   int64  
 1   Year_Birth           2240 non-null   int64  
 2   Education            2240 non-null   object 
 3   Marital_Status       2240 non-null   object 
 4   Income               2216 non-null   float64
 5   Kidhome              2240 non-null   int64  
 6   Teenhome             2240 non-null   int64  
 7   Dt_Customer          2240 non-null   object 
 8   Recency              2240 non-null   int64  
 9   MntWines             2240 non-null   int64  
 10  MntFruits            2240 non-null   int64  
 11  MntMeatProducts      2240 non-null   int64  
 12  MntFishProducts      2240 non-null   int64  
 13  MntSweetProducts     2240 non-null   int64  
 14  MntGoldProds         2240 non-null   int64  
 15  NumDealsPurchases    2240 non-null   i

In [25]:
# Display summary statistics
print("\
Summary statistics:")
print(df.describe())

# Calculate the response rate
response_rate = df['Response'].mean()
print(f"\
Overall response rate: {response_rate:.2%}")

Summary statistics:
                 ID   Year_Birth         Income      Kidhome     Teenhome  \
count   2240.000000  2240.000000    2216.000000  2240.000000  2240.000000   
mean    5592.159821  1968.805804   52247.251354     0.444196     0.506250   
std     3246.662198    11.984069   25173.076661     0.538398     0.544538   
min        0.000000  1893.000000    1730.000000     0.000000     0.000000   
25%     2828.250000  1959.000000   35303.000000     0.000000     0.000000   
50%     5458.500000  1970.000000   51381.500000     0.000000     0.000000   
75%     8427.750000  1977.000000   68522.000000     1.000000     1.000000   
max    11191.000000  1996.000000  666666.000000     2.000000     2.000000   

           Recency     MntWines    MntFruits  MntMeatProducts  \
count  2240.000000  2240.000000  2240.000000      2240.000000   
mean     49.109375   303.935714    26.302232       166.950000   
std      28.962453   336.597393    39.773434       225.715373   
min       0.000000     0.0

The dataset contains 2,240 entries with 29 columns.
There are some missing values in the "Income" column.
The overall response rate for the marketing campaign is approximately 14.91%

## Handling missing values

In [30]:
# Handle missing values by filling them with the median income
median_income = df['Income'].median()
df['Income'] = df['Income'].fillna(median_income)  # Avoid inplace modification

# Convert categorical variables to numerical using one-hot encoding
categorical_cols = ['Education', 'Marital_Status']
df = pd.get_dummies(df, columns=categorical_cols, drop_first=True)

# Feature engineering: create new features
# Total number of children
df['Total_Children'] = df['Kidhome'] + df['Teenhome']

# Total amount spent on all products
df['Total_Spent'] = (df['MntWines'] + df['MntFruits'] + df['MntMeatProducts'] +
                     df['MntFishProducts'] + df['MntSweetProducts'] + df['MntGoldProds'])

# Drop columns that are not needed for the model
columns_to_drop = ['ID', 'Dt_Customer', 'Z_CostContact', 'Z_Revenue']
df.drop(columns=columns_to_drop, inplace=True)

# Display the first few rows of the modified dataset
print("Modified dataset:")
print(df.head())

Modified dataset:
   Year_Birth   Income  Kidhome  Teenhome  Recency  MntWines  MntFruits  \
0        1957  58138.0        0         0       58       635         88   
1        1954  46344.0        1         1       38        11          1   
2        1965  71613.0        0         0       26       426         49   
3        1984  26646.0        1         0       26        11          4   
4        1981  58293.0        1         0       94       173         43   

   MntMeatProducts  MntFishProducts  MntSweetProducts  ...  Education_PhD  \
0              546              172                88  ...          False   
1                6                2                 1  ...          False   
2              127              111                21  ...          False   
3               20               10                 3  ...          False   
4              118               46                27  ...           True   

   Marital_Status_Alone  Marital_Status_Divorced  Marital_Status_Mar

## Modeling and Evaluation

In [33]:
import pandas as pd
import numpy as np
from sklearn.feature_selection import SelectKBest, f_classif
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt
import seaborn as sns

# Separate features and target
X = df.drop('Response', axis=1)
y = df['Response']

# Feature selection
selector = SelectKBest(f_classif, k=15)
X_selected = selector.fit_transform(X, y)
selected_features = X.columns[selector.get_support()].tolist()

print("Selected features:")
print(selected_features)

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X_selected, y, test_size=0.2, random_state=42)

# Scale the features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Initialize models
rf_model = RandomForestClassifier(n_estimators=100, random_state=42)
gb_model = GradientBoostingClassifier(n_estimators=100, random_state=42)
lr_model = LogisticRegression(random_state=42)

models = [rf_model, gb_model, lr_model]
model_names = ['Random Forest', 'Gradient Boosting', 'Logistic Regression']

# Train and evaluate models
for model, name in zip(models, model_names):
    model.fit(X_train_scaled, y_train)
    y_pred = model.predict(X_test_scaled)
    
    print(f"\
{name} Classification Report:")
    print(classification_report(y_test, y_pred))

    # Feature importance
    if hasattr(model, 'feature_importances_'):
        feature_importance = model.feature_importances_
    else:
        feature_importance = np.abs(model.coef_[0])
    
    feature_importance_dict = dict(zip(selected_features, feature_importance))
    sorted_importance = sorted(feature_importance_dict.items(), key=lambda x: x[1], reverse=True)

    print(f"\
{name} Feature Importance:")
    for feature, importance in sorted_importance[:5]:
        print(f"{feature}: {importance:.4f}")

    # Visualize feature importance
    plt.figure(figsize=(10, 6))
    sns.barplot(x=[imp for _, imp in sorted_importance], y=[feat for feat, _ in sorted_importance])
    plt.title(f"{name} Feature Importance")
    plt.xlabel("Importance")
    plt.ylabel("Features")
    plt.tight_layout()
    plt.savefig(f'{name.lower().replace(" ", "_")}_feature_importance.png')
    plt.close()

    print(f"\
{name} feature importance plot saved as '{name.lower().replace(' ', '_')}_feature_importance.png'")

    # Calculate profit
    cost_per_contact = 3
    revenue_per_response = 11

    def calculate_profit(y_true, y_pred):
        true_positives = ((y_true == 1) & (y_pred == 1)).sum()
        false_positives = ((y_true == 0) & (y_pred == 1)).sum()
        
        revenue = true_positives * revenue_per_response
        cost = (true_positives + false_positives) * cost_per_contact
        profit = revenue - cost
        
        return profit

    test_profit = calculate_profit(y_test, y_pred)
    print(f"\
{name} estimated profit on test set: ${test_profit:.2f}")

# Calculate profit if we contact everyone
everyone_profit = calculate_profit(y_test, np.ones_like(y_test))
print(f"\
Profit if we contact everyone: ${everyone_profit:.2f}")

# Calculate profit if we contact no one
no_one_profit = calculate_profit(y_test, np.zeros_like(y_test))
print(f"Profit if we contact no one: ${no_one_profit:.2f}")

# Compare model performances
model_profits = [calculate_profit(y_test, model.predict(X_test_scaled)) for model in models]
best_model_index = np.argmax(model_profits)
best_model_name = model_names[best_model_index]
best_model_profit = model_profits[best_model_index]

print(f"\
Best performing model: {best_model_name}")
print(f"Best model profit: ${best_model_profit:.2f}")
print(f"Profit improvement over contacting everyone: {(best_model_profit - everyone_profit) / abs(everyone_profit) * 100:.2f}%")

Selected features:
['Income', 'Teenhome', 'Recency', 'MntWines', 'MntMeatProducts', 'MntGoldProds', 'NumWebPurchases', 'NumCatalogPurchases', 'AcceptedCmp3', 'AcceptedCmp4', 'AcceptedCmp5', 'AcceptedCmp1', 'AcceptedCmp2', 'Total_Children', 'Total_Spent']
Random Forest Classification Report:
              precision    recall  f1-score   support

           0       0.88      0.97      0.92       379
           1       0.61      0.28      0.38        69

    accuracy                           0.86       448
   macro avg       0.75      0.62      0.65       448
weighted avg       0.84      0.86      0.84       448

Random Forest Feature Importance:
Recency: 0.1367
Total_Spent: 0.1189
Income: 0.1114
MntMeatProducts: 0.1081
MntWines: 0.1058
Random Forest feature importance plot saved as 'random_forest_feature_importance.png'
Random Forest estimated profit on test set: $116.00
Gradient Boosting Classification Report:
              precision    recall  f1-score   support

           0       0.