# Predictive Ethnicity Classification for Customer Names

## Problem Statement
In the realm of e-commerce and customer analytics, understanding the diverse demographics of a business customer base is crucial. This project aims to predict the ethnicity of customers based on their names using machine learning techniques. The challenge lies in developing accurate models that can infer ethnicity from names, considering the multicultural diversity within the customer database.

## Objective
The primary objective of this project is to build and evaluate machine learning models, including Naive Bayes, Random Forest, and SVM, to predict the likely ethnicity of customers based on their names. By leveraging natural language processing and classification algorithms, the project seeks to provide insights into the ethnic distribution of customers within the dataset. This predictive capability can enhance demographic profiling and enable more targeted marketing strategies tailored to specific ethnic groups.

## Approach
1. Preprocess the textual data by tokenization, stemming, and removing stopwords.
2. Vectorize the processed names using techniques like CountVectorizer to convert text into numerical features.
3. Train and evaluate machine learning models, including Naive Bayes, Random Forest, and SVM.
4. Predict the ethnicity of new customer names using the trained models.
5. Analyze and compare the performance of different models.

## Expected Outcome
The project aims to deliver accurate ethnicity predictions for customer names, facilitating a deeper understanding of the customer demographics. This predictive capability can inform business decisions, marketing strategies, and personalized customer engagement efforts.

# ML Training and Prediction

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics import accuracy_score, classification_report
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize

import nltk
nltk.download('stopwords')
nltk.download('punkt')

In [None]:
#import dataset
#file with race data
data = pd.read_excel(r"data\jan23-jun23_ethnic.xlsx")
#data = data.sample(frac=0.04, random_state=42) smaller dataset used during code testing

In [None]:
#convert full_name column from integers to string
data['ShippingName'] = data['ShippingName'].astype(str)

#remove rows where Ethnicity has False values
data = data[data['Ethnicity'] != False]

In [None]:
# Text preprocessing
stop_words = set(stopwords.words('english'))
stemmer = PorterStemmer()

def preprocess_text(text):
    words = word_tokenize(text)
    words = [stemmer.stem(word) for word in words if word.isalpha() and word.lower() not in stop_words]
    return ' '.join(words)

data['processed_name'] = data['ShippingName'].apply(preprocess_text)

In [None]:
# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(data['processed_name'], data['Ethnicity'], test_size=0.2, random_state=42)

In [None]:
# Calculate baseline accuracy (predicting the most frequent class)
baseline_accuracy = y_test.value_counts().max() / len(y_test)
print(f'Baseline Accuracy: {baseline_accuracy:.2%}')

In [None]:
# Tag: customer_names_handling
# Feature extraction using CountVectorizer
vectorizer = CountVectorizer()
X_train_vectorized = vectorizer.fit_transform(X_train)
X_test_vectorized = vectorizer.transform(X_test)

# Notes about feature extraction from customer names
#TikTok MY customers have a shipping_firsname which is encapsulated within asterisks (TikTok's pirvacy policy), therefore the model has ignores those names when tokenizing the names
#For Jul 23 - Sep 23 dataset, they represent around 500 observations out of the 40k+ observations, so the effect of considering those names as "Others" is insignificant as it represents less than 1% of the entire dataset

In [None]:
# Import necessary libraries for Naive Bayes Classifier
from sklearn.naive_bayes import MultinomialNB

# Train a Naive Bayes classifier
classifier = MultinomialNB()
classifier.fit(X_train_vectorized, y_train)

# Make predictions on the test set
predictions = classifier.predict(X_test_vectorized)

# Evaluate the Naive Bayes model
accuracy = accuracy_score(y_test, predictions)
print(f'Accuracy: {accuracy}')
print('\nClassification Report:')
print(classification_report(y_test, predictions))

In [None]:
# Naive Bayes Classifier
# Dispay test dataset with actual vs predicted values

# Create a DataFrame with the test dataset and a copy of relevant columns
results_table_nb = pd.DataFrame({'Processed Name': X_test, 'Actual Ethnicity': y_test, 'Predicted Ethnicity': predictions})

# Display the table
print('\nActual vs Predicted Ethnicity Table (Naive Bayes Classifier):')
print('Test Dataset (Random 20% of values)')
print(results_table_nb)

In [None]:
# Import necessary libraries for Random Forest
from sklearn.ensemble import RandomForestClassifier

# Train a Random Forest classifier
rf_classifier = RandomForestClassifier(random_state=42)
rf_classifier.fit(X_train_vectorized, y_train)

# Make predictions on the test set using Random Forest
rf_predictions = rf_classifier.predict(X_test_vectorized)

# Evaluate the Random Forest model
rf_accuracy = accuracy_score(y_test, rf_predictions)
print(f'Random Forest Accuracy: {rf_accuracy}')
print('\nClassification Report (Random Forest):')
print(classification_report(y_test, rf_predictions))


In [None]:
# Get numerical feature importances
importances = list(rf_classifier.feature_importances_)

# List of tuples with variable and importance
feature_importances_rf = [(feature, round(importance, 2)) for feature, importance in zip(vectorizer.get_feature_names_out(), importances)]

# Sort the feature importances by most important first
feature_importances_rf = sorted(feature_importances_rf, key=lambda x: x[1], reverse=True)

# Show top 10 feature importances
top_feature_importances_rf = feature_importances_rf[:10]
[print('Variable: {:20} Importance: {}'.format(*pair)) for pair in top_feature_importances_rf]


In [None]:
import matplotlib.pyplot as plt

# Plot feature importances
features, importances = zip(*top_feature_importances_rf)
plt.figure(figsize=(10, 6))
plt.bar(features, importances, color='skyblue')
plt.xlabel('Feature')
plt.ylabel('Importance')
plt.title('Top 10 Feature Importances')
plt.xticks(rotation=45, ha='right')  # Rotate x-axis labels for better readability
plt.tight_layout()
plt.show()

In [None]:
# Random Forest
# Dispay test dataset with actual vs predicted values

# Create a DataFrame with the test dataset and a copy of relevant columns
results_table_rf = pd.DataFrame({'Processed Name': X_test, 'Actual Ethnicity': y_test, 'Predicted Ethnicity': rf_predictions})

# Display the table
print('\nActual vs Predicted Ethnicity Table (Random Forest):')
print('Test Dataset (Random 20% of values)')
print(results_table_rf)

In [None]:
# Import necessary libraries for SVM
from sklearn.svm import SVC

# Train a Support Vector Machine (SVM) classifier
svm_classifier = SVC(random_state=42)
svm_classifier.fit(X_train_vectorized, y_train)

# Make predictions on the test set using SVM
svm_predictions = svm_classifier.predict(X_test_vectorized)

# Evaluate the SVM model
svm_accuracy = accuracy_score(y_test, svm_predictions)
print(f'SVM Accuracy: {svm_accuracy}')
print('\nClassification Report (SVM):')
print(classification_report(y_test, svm_predictions))

In [None]:
# SVM
# Dispay test dataset with actual vs predicted values

# Create a DataFrame with the test dataset and a copy of relevant columns
results_table_svm = pd.DataFrame({'Processed Name': X_test, 'Actual Ethnicity': y_test, 'Predicted Ethnicity': svm_predictions})

# Display the table
print('\nActual vs Predicted Ethnicity Table (SVM):')
print('Test Dataset (Random 20% of values)')
print(results_table_svm)

In [None]:
# Create a DataFrame with model accuracies
accuracy_df = pd.DataFrame({
    'Model': ['Naive Bayes', 'Random Forest', 'SVM'],
    'Accuracy': [accuracy, rf_accuracy, svm_accuracy]
})

# Display the accuracy table
print('\nModel Accuracies:')
print(accuracy_df)

In [None]:
from prettytable import PrettyTable

# Detailed model performance metrics
# Function to create a pretty table from a classification report
def create_pretty_table(report, model_name):
    table = PrettyTable()
    
    # Check if the report is a dictionary
    if isinstance(report, dict):
        table.field_names = ["Class"] + list(report['weighted avg'].keys())
        for cls, values in report.items():
            # Check if values is a dictionary
            if isinstance(values, dict):
                table.add_row([cls] + list(values.values()))
    else:
        # Handle case where the report is a single float value
        table.field_names = ["Metric", "Value"]
        table.add_row(["Accuracy", report])

    table.title = f"Classification Report ({model_name})"
    return table

# Create classification reports
report_nb = classification_report(y_test, predictions, output_dict=True)
report_rf = classification_report(y_test, rf_predictions, output_dict=True)
report_svm = classification_report(y_test, svm_predictions, output_dict=True)

# Create PrettyTables for each classification report
table_nb = create_pretty_table(report_nb, "Naive Bayes")
table_rf = create_pretty_table(report_rf, "Random Forest")
table_svm = create_pretty_table(report_svm, "SVM")

# Print individual tables
print(table_nb)
print(table_rf)
print(table_svm)

In [None]:
# Plot confusion matrix for all three models
# model performance assessment on test data

import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import confusion_matrix

# Function to create confusion matrix and plot it
def plot_confusion_matrix(ax, y_true, y_pred, model_name, color):
    cm = confusion_matrix(y_true, y_pred, labels=data['Ethnicity'].unique())
    sns.heatmap(cm, annot=True, fmt='d', cmap=color, xticklabels=data['Ethnicity'].unique(), yticklabels=data['Ethnicity'].unique(), ax=ax)
    ax.set_title(f'Confusion Matrix - {model_name}')
    ax.set_xlabel('Predicted Ethnicity')
    ax.set_ylabel('Actual Ethnicity')

# Create subplots
fig, axes = plt.subplots(1, 3, figsize=(18, 6))

# Confusion matrix and plot for Naive Bayes
plot_confusion_matrix(axes[0], y_test, predictions, 'Naive Bayes', 'Blues')

# Confusion matrix and plot for Random Forest
plot_confusion_matrix(axes[1], y_test, rf_predictions, 'Random Forest', 'Greens')

# Confusion matrix and plot for SVM
plot_confusion_matrix(axes[2], y_test, svm_predictions, 'SVM', 'Reds')

# Adjust layout and show the plots
plt.tight_layout()
plt.show()

In [None]:
# Binary classification ROC Curves
# Import necessary libraries for ROC curves
from sklearn.metrics import roc_curve, auc

# Function to plot ROC curves for each category and each model
def plot_roc_curves(model, model_name, predictions, y_test):
    
    # Get the unique categories
    categories = data['Ethnicity'].unique()
    
    for category in categories:
        # Create binary ground truth for the specific category
        y_true_category = (y_test == category)
        # Create binary predictions for the specific category
        y_pred_category = (predictions == category)
        
        # Compute ROC curve for the specific category
        fpr, tpr, _ = roc_curve(y_true_category, y_pred_category)
        # Compute AUC for the specific category
        roc_auc = auc(fpr, tpr)
        
        # Plot ROC curve for the specific category
        plt.plot(fpr, tpr, label=f'ROC curve for {category} (AUC = {roc_auc:.2f})')

    plt.plot([0, 1], [0, 1], 'k--', label='Chance level (AUC = 0.5)')
    plt.xlabel('False Positive Rate')
    plt.ylabel('True Positive Rate')
    plt.title(f'ROC Curves - {model_name}')
    plt.legend()
    plt.show()

# Plot ROC curves for Naive Bayes
plot_roc_curves(classifier, 'Naive Bayes', predictions, y_test)

# Plot ROC curves for Random Forest
plot_roc_curves(rf_classifier, 'Random Forest', rf_predictions, y_test)

# Plot ROC curves for SVM
plot_roc_curves(svm_classifier, 'SVM', svm_predictions, y_test)

In [None]:
# One vs Rest Classifier ROC

from sklearn.preprocessing import label_binarize
from sklearn.metrics import roc_curve, auc
from sklearn.multiclass import OneVsRestClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
import matplotlib.pyplot as plt

# Convert labels to numerical values for ROC curve
y_test_bin = label_binarize(y_test, classes=data['Ethnicity'].unique())

def plot_multiclass_roc(model, model_name):
    # One-vs-Rest strategy
    classifier = OneVsRestClassifier(model)
    classifier.fit(X_train_vectorized, y_train)

    # Access model's predictions (or decision function for SVC)
    if isinstance(classifier, OneVsRestClassifier) and isinstance(classifier.estimators_[0], SVC):
        y_score = classifier.decision_function(X_test_vectorized)
    else:
        y_score = classifier.predict_proba(X_test_vectorized)

    # Initialize variables to store fpr, tpr, and roc_auc for each class
    fpr = dict()
    tpr = dict()
    roc_auc = dict()

    # Compute ROC curve and ROC area for each class
    for i in range(y_test_bin.shape[1]):
        fpr[i], tpr[i], _ = roc_curve(y_test_bin[:, i], y_score[:, i])
        roc_auc[i] = auc(fpr[i], tpr[i])

        # Plot ROC curve for each class
        plt.plot(fpr[i], tpr[i], label=f'ROC curve for {data["Ethnicity"].unique()[i]} (AUC = {roc_auc[i]:.2f})')

    plt.xlabel('False Positive Rate')
    plt.ylabel('True Positive Rate')
    plt.title(f'Multiclass ROC curves - {model_name}')
    plt.legend()
    plt.show()

# Plot ROC curves for each model
plot_multiclass_roc(MultinomialNB(), 'Naive Bayes')
plot_multiclass_roc(RandomForestClassifier(random_state=42), 'Random Forest')
plot_multiclass_roc(SVC(random_state=42, probability=True), 'SVM')

In [None]:
# Use the models to predict on new data
# Upload new data file
new_data = pd.read_excel('data/HJecommQ3blankethnicity.xlsx')
new_data['shipping_firstname'] = new_data['shipping_firstname'].astype(str)
new_data['processed_name'] = new_data['shipping_firstname'].apply(preprocess_text)

# Vectorize the new data
new_data_vectorized = vectorizer.transform(new_data['processed_name'])

# Predictions using Naive Bayes
new_predictions_nb = classifier.predict(new_data_vectorized)

# Predictions using Random Forest
new_predictions_rf = rf_classifier.predict(new_data_vectorized)

# Predictions using SVM
new_predictions_svm = svm_classifier.predict(new_data_vectorized)

In [None]:
from IPython.display import display, HTML

link_html = '<a href="#customer_names_handling">Refer to special name nomenclature cases</a>'
display(HTML(link_html))

# Handling special cases of customer names
# Define exceptional cases
exceptional_cases = ["nan", ""]

# Assign "Others" to predicted ethnicity for rows where processed_name matches exceptional cases
for case in exceptional_cases:
    new_predictions_nb[new_data['processed_name'] == case] = 'Others'
    new_predictions_rf[new_data['processed_name'] == case] = 'Others'
    new_predictions_svm[new_data['processed_name'] == case] = 'Others'

In [None]:
# Add the predictions to the new dataset
new_data['Predicted_Ethnicity_NB'] = new_predictions_nb
new_data['Predicted_Ethnicity_RF'] = new_predictions_rf
new_data['Predicted_Ethnicity_SVM'] = new_predictions_svm

In [None]:
# Define data and time for file nomenclature
from datetime import datetime

# Get current date and time
current_datetime = datetime.now().strftime("%d%m%Y_%H%M%S%p")
print(current_datetime)

In [None]:
# Export file
new_data.to_excel(f'Data/Predictions/ecomm_demo_2023Q3_predicted_ethnicities_{current_datetime}.xlsx', index=False)