`Naive Bayes Classifier for Pre-Purchase and Post-Purchase Calls`

`November 2025`

This project presents the findings of a classification task aimed at predicting whether a customer call is pre-purchase or post-purchase. The analysis utilizes a Naive Bayes classifier trained on a dataset of customer interactions.

`Any questions, please reach out!`

Chiawei Wang, PhD\
Data & Product Analyst\
<chiawei.w@outlook.com>

`*` Note that the table of contents and other links may not work directly on GitHub.

[Table of Contents](#table-of-contents)
1. [Executive Summary](#executive-summary)
   - [Background](#background)
   - [Research Questions](#research-questions)
   - [Data Overview](#data-overview)
   - [Approach](#approach)
   - [Results](#results)
   - [Conclusion](#conclusion)
2. [Exploratory Data Analysis](#exploratory-data-analysis)

# Executive Summary

## Background

Customer service calls play a crucial role in shaping customer satisfaction and loyalty. Understanding the nature of these calls, whether they occur before or after a purchase, can provide valuable insights for businesses to enhance their customer service strategies. This report focuses on building a Naive Bayes classifier to predict whether a customer call is pre-purchase or post-purchase based on various features extracted from the call data.

## Research Questions

1. How accurate is the Naive Bayes classifier in predicting the type of customer call?
2. What are the key features that contribute to the classification of customer calls?

## Data Overview

The dataset contains the following columns:

| Column  | Type   | Description                                       |
| ------- | ------ | ------------------------------------------------- |
| `label` | object | Whether the call is pre-purchase or post-purchase |
| `text`  | object | Transcription of the customer call                |

## Approach

1. Create text_classifier using CountVectorizer(), TfidfTransformer(), and MultinomialNB()
2. Fit text_classifier on train_df.text and train_df.label
3. Create predicted by calling predict() on text_classifier and passing it the text column of test_df
4. Evaluate the model by seeing how predicted compares to the test_df.label

## Results

- The Naive Bayes classifier achieved an accuracy of 95.24% on the test set, indicating a strong ability to distinguish between pre-purchase and post-purchase calls.
- Key features contributing to the classification included specific keywords and phrases commonly associated with pre-purchase inquiries (e.g. wondering, looking) and post-purchase issues (e.g. received, wrong).

## Conclusion

The analysis demonstrates the potential of using machine learning techniques, such as Naive Bayes, to enhance customer service strategies. By accurately classifying customer calls, businesses can tailor their support efforts to better meet customer needs and improve overall satisfaction.

# Exploratory Data Analysis

In [1]:
# Import necessary libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.naive_bayes import MultinomialNB

In [2]:
# Read in the CSV as a DataFrame
df = pd.read_csv('customers.csv')

# Preview the data
print(df.shape)
df.head()

(102, 2)


Unnamed: 0,label,text
0,pre_purchase,how's it going Arthur I just placed an order w...
1,post_purchase,yeah hello I'm just wondering if I can speak t...
2,post_purchase,hey I receive my order but it's the wrong size...
3,pre_purchase,hi David I just placed an order online and I w...
4,post_purchase,hey I bought something from your website the o...


In [3]:
# Naive Bayes classifier for whether the customer call is pre-purchase or post-purchase
classifier = Pipeline([
    ('vectorizer', CountVectorizer(lowercase = True, stop_words = 'english')), # ngram_range=(1, 3)
    ('tfidf', TfidfTransformer()),
    ('classifier', MultinomialNB()),
])

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(df['text'], df['label'], test_size = 0.2,random_state = 42) # stratify = df['label']

# Fit the classifier pipeline on the training data
classifier.fit(X_train, y_train)

# Evaluate the MultinomialNB model
predicted = classifier.predict(X_test)

# Evaluate accuracy
accuracy = 100 * np.mean(predicted == y_test)
print(f'The model is {accuracy:.2f}% accurate!')

# Example usage
print()
print('Example prediction:')
print()
call1 = ["Hi, I'm interested in knowing more about your premium plan. Can you tell me the price?"]
call2 = ['I want to cancel my subscription because the service is awful.']
prediction1 = classifier.predict(call1)
prediction2 = classifier.predict(call2)
print(f'Customer call: {call1[0]}')
print(f'Predicted category: {prediction1[0]}')
print()
print(f'Customer call: {call2[0]}')
print(f'Predicted category: {prediction2[0]}')

The model is 95.24% accurate!

Example prediction:

Customer call: Hi, I'm interested in knowing more about your premium plan. Can you tell me the price?
Predicted category: pre_purchase

Customer call: I want to cancel my subscription because the service is awful.
Predicted category: post_purchase


In [4]:
# Feature Importance in the Naive Bayes Classifier
feature_names = classifier.named_steps['vectorizer'].get_feature_names_out()

# Extract the classifier's log-probabilities
nb_model = classifier.named_steps['classifier']
log_probs = nb_model.feature_log_prob_ 

# Get the class labels (e.g. 'pre-purchase', 'post-purchase')
classes = nb_model.classes_

# Get the log probabilities for the first and second classes
class_0_log_prob = log_probs[0]  # Log probability for the first class (classes[0])
class_1_log_prob = log_probs[1]  # Log probability for the second class (classes[1])

# Calculate Feature Importance as: (Class 1 Log Prob) - (Class 0 Log Prob)
feature_importance = class_1_log_prob - class_0_log_prob

# Compile results into a DataFrame
feature_df = pd.DataFrame({'Feature': feature_names, 'Importance Score': feature_importance})

# Sort and display results
top_class_1 = feature_df.sort_values(by='Importance Score', ascending=False).head(10)
top_class_0 = feature_df.sort_values(by='Importance Score', ascending=True).head(10)

# Display the top indicative words for each class
print(f'Top 10 words most indicative of {classes[0]} (importance score > 0):')
print(top_class_1.to_markdown(index = False))
print()
print(f'Top 10 words most indicative of {classes[1]} (importance score < 0):')
print(top_class_0.to_markdown(index = False))

Top 10 words most indicative of post_purchase (importance score > 0):
| Feature   |   Importance Score |
|:----------|-------------------:|
| hi        |           1.58195  |
| wondering |           1.5162   |
| placed    |           1.37644  |
| looking   |           1.19542  |
| cancel    |           1.13404  |
| new       |           1.06804  |
| place     |           0.908883 |
| online    |           0.907806 |
| make      |           0.846021 |
| phone     |           0.834993 |

Top 10 words most indicative of pre_purchase (importance score < 0):
| Feature   |   Importance Score |
|:----------|-------------------:|
| talk      |           -1.44801 |
| yesterday |           -1.38036 |
| received  |           -1.34802 |
| wrong     |           -1.29213 |
| arrived   |           -1.26495 |
| bought    |           -1.18181 |
| package   |           -1.1794  |
| delivered |           -1.12927 |
| product   |           -1.05325 |
| size      |           -1.05083 |
