# Problem Statement
## Customer Lifetime Value (CLV) Prediction (Classification)
Build a machine learning model that predicts the Customer Lifetime Value (CLV) of a customer based on their demographic information and transaction history.

### Design
Based on the initial transactions/interactions of a customer, we want to identify if they will have a higher than average cltv or lower than average.<br>

We hardly have any data in this dataset, but we can try to make a simple model and see if it works.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sklearn.feature_selection import chi2, f_classif, mutual_info_classif

from sklearn.inspection import permutation_importance

from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report, roc_auc_score

from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.preprocessing import StandardScaler

from sklearn.compose import ColumnTransformer

from sklearn.linear_model import LogisticRegression
from xgboost import XGBClassifier
from sklearn.ensemble import RandomForestClassifier

In [2]:
pd.set_option("display.max_colwidth", None)

In [3]:
from os import walk

path = "../data/processed/"
filenames = next(walk(path), (None, None, []))[2]
print(filenames)

['campaigns.csv', 'customers.csv', 'customer_reviews_complete.csv', 'interactions.csv', 'support_tickets.csv', 'transactions.csv']


In [4]:
campaigns = pd.read_csv(path+"campaigns.csv")
customers = pd.read_csv(path+"customers.csv")
customer_reviews_complete = pd.read_csv(path+"customer_reviews_complete.csv")
interactions = pd.read_csv(path+"interactions.csv")
support_tickets = pd.read_csv(path+"support_tickets.csv")
transactions = pd.read_csv(path+"transactions.csv")

## Features

In [5]:
print(list(customers.columns))
print(list(transactions.columns))
print(list(campaigns.columns))
print(list(interactions.columns))
print(list(customer_reviews_complete.columns))
print(list(support_tickets.columns))

['Unnamed: 0', 'customer_id', 'full_name', 'age', 'gender', 'email', 'phone', 'street_address', 'city', 'state', 'zip_code', 'registration_date', 'preferred_channel']
['Unnamed: 0', 'transaction_id', 'customer_id', 'product_name', 'product_category', 'quantity', 'price', 'transaction_date', 'store_location', 'payment_method', 'discount_applied']
['Unnamed: 0', 'campaign_id', 'campaign_name', 'campaign_type', 'start_date', 'end_date', 'target_segment', 'budget', 'impressions', 'clicks', 'conversions', 'conversion_rate', 'roi']
['Unnamed: 0', 'interaction_id', 'customer_id', 'channel', 'interaction_type', 'interaction_date', 'duration', 'page_or_product', 'session_id']
['Unnamed: 0', 'review_id', 'customer_id', 'product_name', 'product_category', 'full_name', 'transaction_date', 'review_date', 'rating', 'review_title', 'review_text']
['Unnamed: 0', 'ticket_id', 'customer_id', 'issue_category', 'priority', 'submission_date', 'resolution_date', 'resolution_status', 'resolution_time_hours',

# Feature Engineering

## Feature Selection

We can include if a customer came from a campaign, but it would include too much noise, so it's better not to.

### 1. INPUT (X)

In [6]:
x_cust = customers[["customer_id", "age", "gender", "state", "preferred_channel"]].copy()
x_cust.head(1)

Unnamed: 0,customer_id,age,gender,state,preferred_channel
0,4C30E132-0704-4459-A509-9Eddde934977,40.0,Male,Texas,Other


In [7]:
first_dates = transactions.groupby(by=["customer_id"])["transaction_date"].min().reset_index()
first_purchase = transactions.merge(first_dates, on=["customer_id", "transaction_date"], how="inner")

x_trans = first_purchase[["customer_id", "product_category", "quantity", "price", "store_location", "payment_method"]].copy()
x_trans.head(1)

Unnamed: 0,customer_id,product_category,quantity,price,store_location,payment_method
0,727839B2-F084-4E94-94D8-Ae59Cc8E4B84,Smart Home Devices,1,140.07,"Houston, Tx",Credit Card


In [8]:
interactions["interaction_date"] = pd.to_datetime(interactions["interaction_date"]).dt.date
first_dates = interactions.groupby(by=["customer_id"])["interaction_date"].min().reset_index()
first_interactions = interactions.merge(first_dates, on=["customer_id", "interaction_date"], how="inner")

x_ints = first_interactions[["customer_id", "channel", "interaction_type", "duration"]].copy()
x_ints.head(1)

Unnamed: 0,customer_id,channel,interaction_type,duration
0,00012Aa8-E99C-4E30-B3F6-1F7E36Adc517,Other,Review,128.0


## Feature Creation

### 1. OUTPUT (Y)

In [9]:
y = transactions.copy()
y["total"] = y["quantity"] * y["price"]
y = y.groupby(by=["customer_id"]).agg(cltv=("total", "sum")).reset_index()
y["cltv"] = y["cltv"] > y["cltv"].mean()
y.head(1)

Unnamed: 0,customer_id,cltv
0,00012Aa8-E99C-4E30-B3F6-1F7E36Adc517,False


### 2. INPUT (X)

In [10]:
def onehot (df, col):
    onehot = pd.get_dummies(df[col], prefix=col)
    onehot = onehot.astype(int)

    df = pd.concat([onehot, df], axis=1)
    df = df.drop(columns=[col])
    return df



def multihot (df, col):
    mlb = MultiLabelBinarizer()
    vals = mlb.fit_transform(df[col])
    cols = mlb.classes_
    df_new = pd.DataFrame(vals, columns=cols)

    df = pd.concat([df_new, df], axis=1)
    df = df.drop(columns=[col])
    return df

#### 1. Customers

In [11]:
x_cust_new = onehot (x_cust, "gender")
x_cust_new = onehot (x_cust_new, "state")
x_cust_new = onehot (x_cust_new, "preferred_channel")
x_cust_new.head(1)

Unnamed: 0,preferred_channel_Both,preferred_channel_In-Store,preferred_channel_Online,preferred_channel_Other,state_Arizona,state_California,state_Florida,state_Georgia,state_Illinois,state_Massachusetts,...,state_Pennsylvania,state_Texas,state_Virginia,state_Washington,gender_Female,gender_Male,gender_Non-Binary,gender_Prefer Not To Say,customer_id,age
0,0,0,0,1,0,0,0,0,0,0,...,0,1,0,0,0,1,0,0,4C30E132-0704-4459-A509-9Eddde934977,40.0


#### 2. Transactions

In [12]:
x_trans["total"] = x_trans["quantity"] * x_trans["price"]
x_trans["product_category"] = x_trans["product_category"].replace("Other", "Product_Other")
x_trans_new = x_trans.groupby(by=["customer_id"]).agg({"product_category": list, "store_location": list, "payment_method": list, "total": "sum"}).reset_index()

x_trans_new = multihot (x_trans_new, "product_category")
x_trans_new = multihot (x_trans_new, "store_location")
x_trans_new = multihot (x_trans_new, "payment_method")

x_trans_new.head(1)

Unnamed: 0,Apple Pay,Cash,Credit Card,Debit Card,Gift Card,Google Pay,Paypal,"Atlanta, Ga","Boston, Ma","Chicago, Il",...,Kitchen Appliances,Laptops,Product_Other,Small Kitchen Appliances,Smart Home Devices,Smartphones,Tablets,Tvs,customer_id,total
0,0,0,0,0,0,0,1,0,0,0,...,1,0,0,0,0,0,0,0,00012Aa8-E99C-4E30-B3F6-1F7E36Adc517,474.93


#### 3. Interactions

In [13]:
x_ints["channel"] = x_ints["channel"].replace("Other", "Channel_Other")
x_ints["interaction_type"] = x_ints["interaction_type"].replace("Other", "Interaction_Other")
x_ints_new = x_ints.groupby(by=["customer_id"]).agg({"channel": list, "interaction_type": list, "duration": "mean"}).reset_index()

x_ints_new = multihot (x_ints_new, "channel")
x_ints_new = multihot (x_ints_new, "interaction_type")
x_ints_new.head(1)

Unnamed: 0,Add_To_Cart,App_Open,Checkout,Interaction_Other,Inventory_Check,Notification_Click,Page_View,Product_Lookup,Product_View,Purchase,...,Search,Session_Start,Store_Map_View,Wishlist_Add,Channel_Other,In-Store Kiosk,Mobile App,Web,customer_id,duration
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,1,0,0,0,00012Aa8-E99C-4E30-B3F6-1F7E36Adc517,128.0


## X and Y Compatability

In [14]:
x = x_cust_new.merge(x_trans_new, on=["customer_id"], how="inner")
x = x.merge(x_ints_new, on=["customer_id"], how="inner")
x = x.fillna(0)
x.head(1)

Unnamed: 0,preferred_channel_Both,preferred_channel_In-Store,preferred_channel_Online,preferred_channel_Other,state_Arizona,state_California,state_Florida,state_Georgia,state_Illinois,state_Massachusetts,...,Review,Search,Session_Start,Store_Map_View,Wishlist_Add,Channel_Other,In-Store Kiosk,Mobile App,Web,duration
0,0,0,0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,1.0


In [15]:
common_customers = set(x["customer_id"]) & set(y["customer_id"])

x = x[x["customer_id"].isin(common_customers)]
y = y[y["customer_id"].isin(common_customers)]

y = y.sort_values(by=["customer_id"]).reset_index(drop=True)
x = x.sort_values(by=["customer_id"]).reset_index(drop=True)

y = y.drop(columns=["customer_id"])
x = x.drop(columns=["customer_id"])

y = y["cltv"]
y

print(y.shape)
print(x.shape)

(4612,)
(4612, 80)


# Pre-EDA
Let's make sure if the features are actually going to give us a good model or not.

## 1. Statistical Tests

In [16]:
chi2_scores, chi2_pvals = chi2(x, y)
f_scores, f_pvals = f_classif(x, y)
mi_scores = mutual_info_classif(x, y)

In [17]:
scores = pd.DataFrame({
    'Feature': x.columns,
    'Chi2 Score': chi2_scores,
    'Chi2 p-value': chi2_pvals,
    'ANOVA F Score': f_scores,
    'ANOVA p-value': f_pvals,
    'Mutual Info': mi_scores
})
scores.head(5)

Unnamed: 0,Feature,Chi2 Score,Chi2 p-value,ANOVA F Score,ANOVA p-value,Mutual Info
0,preferred_channel_Both,32.805262,1.018687e-08,47.007493,8.005545e-12,0.003743
1,preferred_channel_In-Store,29.046908,7.064678e-08,36.035505,2.086166e-09,0.012551
2,preferred_channel_Online,0.609681,0.4349079,1.206077,0.2721678,0.00306
3,preferred_channel_Other,2.322913,0.1274811,2.375643,0.1233095,0.0
4,state_Arizona,0.014627,0.9037359,0.015092,0.9022309,0.000956


In [18]:
scores["Chi Good"] = scores["Chi2 p-value"] < 0.05
scores["ANOVA Good"] = scores["ANOVA p-value"] < 0.05

In [19]:
scores_filtered = scores[(scores["Chi Good"] == True) & (scores["ANOVA Good"] == True)]

selected_features = scores_filtered["Feature"]
print(len(selected_features))
print(selected_features)

12
0         preferred_channel_Both
1     preferred_channel_In-Store
16            state_Pennsylvania
24                           age
43               Audio Equipment
48                     Furniture
49               Gaming Consoles
50                    Home Decor
52                       Laptops
56                   Smartphones
58                           Tvs
59                         total
Name: Feature, dtype: object


# Data Splitting

## Train Test Split

In [20]:
selected_features = x.columns # comment to include only filtered features

X_train, X_test, y_train, y_test = train_test_split(
    x[selected_features], y, test_size=0.2, random_state=42, shuffle=True
)

### Imbalance Check

In [21]:
print(y_train.value_counts())
print(y_test.value_counts())

cltv
False    2300
True     1389
Name: count, dtype: int64
cltv
False    573
True     350
Name: count, dtype: int64


The split does not seem to be imbalanced.

## Normalization

Select whether to get a few filtered features from statistical eda or all of them.

In [22]:
numeric_cols = ["age", "duration", "total"]
numeric_cols = [c for c in numeric_cols if c in selected_features]
encoded_cols = [c for c in selected_features if not c in numeric_cols]

preprocessor = ColumnTransformer(
    transformers=[
        ("num", StandardScaler(), numeric_cols),
        ("encoded", "passthrough", encoded_cols)
    ]
)

X_train_norm = preprocessor.fit_transform(X_train)
X_test_norm = preprocessor.transform(X_test)

X_train_norm = pd.DataFrame(X_train_norm, columns=X_train.columns)
X_test_norm = pd.DataFrame(X_test_norm, columns=X_train.columns)

In [23]:
X_train_norm

Unnamed: 0,preferred_channel_Both,preferred_channel_In-Store,preferred_channel_Online,preferred_channel_Other,state_Arizona,state_California,state_Florida,state_Georgia,state_Illinois,state_Massachusetts,...,Review,Search,Session_Start,Store_Map_View,Wishlist_Add,Channel_Other,In-Store Kiosk,Mobile App,Web,duration
0,0.552802,0.880538,-0.387765,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
1,-1.398896,-0.556720,-0.187304,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
2,-0.562454,2.474806,-0.376277,1.0,0.0,0.0,0.0,0.0,1.0,0.0,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
3,-0.097764,0.192103,-0.043273,0.0,0.0,1.0,0.0,0.0,0.0,0.0,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
4,-0.841268,0.892615,4.557245,0.0,0.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3684,-1.213020,-0.858665,-0.133731,1.0,0.0,0.0,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0
3685,-1.584772,-0.713731,-0.245918,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0
3686,-1.027144,-0.351397,0.087619,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
3687,-1.584772,0.010936,-0.387494,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0


# Model

In [24]:
def train_classification_model(model, X_train, y_train, cv=5, scoring='accuracy'):
    cv_scores = cross_val_score(model, X_train, y_train, cv=cv, scoring=scoring)

    print("Cross-Validation Scores:", cv_scores)
    print(f"Mean CV {scoring.capitalize()}: {cv_scores.mean():.4f}")
    print(f"Std Dev CV {scoring.capitalize()}: {cv_scores.std():.4f}")

    return model, cv_scores



def evaluate_classification_model(model, X_test, y_test, y_pred):
    if len(set(y_test)) != 2:
        raise ValueError("Target variable must be binary for classification.")

    accuracy = accuracy_score(y_test, y_pred)

    conf_matrix = confusion_matrix(y_test, y_pred)

    class_report = classification_report(y_test, y_pred)

    try:
        roc_auc = roc_auc_score(y_test, model.predict_proba(X_test)[:, 1])
    except AttributeError:
        roc_auc = None

    print("Performance Evaluation:")
    print(f"Accuracy: {accuracy:.4f}")
    print(f"Confusion Matrix:\n{conf_matrix}")
    print(f"Classification Report:\n{class_report}")
    if roc_auc is not None:
        print(f"ROC-AUC Score: {roc_auc:.4f}")
    else:
        print("ROC-AUC Score is not available for this model.")

    return accuracy, conf_matrix, class_report, roc_auc

## Training

### 1. Logistic Regression

In [25]:
model_lr = LogisticRegression(solver='liblinear', random_state=42)
model_lr, scores = train_classification_model(model_lr, X_train_norm, y_train)
model_lr.fit(X_train_norm, y_train)

y_pred_train = model_lr.predict(X_train_norm)
y_pred_test = model_lr.predict(X_test_norm)

Cross-Validation Scores: [0.6395664  0.63143631 0.62872629 0.64363144 0.60786974]
Mean CV Accuracy: 0.6302
Std Dev CV Accuracy: 0.0124


### 2. XGB Classifier

In [26]:
model_xgb = XGBClassifier(random_state=42, use_label_encoder=False, eval_metric='logloss')
model_xgb, scores = train_classification_model(model_xgb, X_train_norm, y_train)
model_xgb.fit(X_train_norm, y_train)

y_pred_train = model_xgb.predict(X_train_norm)
y_pred_test = model_xgb.predict(X_test_norm)

Parameters: { "use_label_encoder" } are not used.

  bst.update(dtrain, iteration=i, fobj=obj)
Parameters: { "use_label_encoder" } are not used.

  bst.update(dtrain, iteration=i, fobj=obj)
Parameters: { "use_label_encoder" } are not used.

  bst.update(dtrain, iteration=i, fobj=obj)
Parameters: { "use_label_encoder" } are not used.

  bst.update(dtrain, iteration=i, fobj=obj)
Parameters: { "use_label_encoder" } are not used.

  bst.update(dtrain, iteration=i, fobj=obj)


Cross-Validation Scores: [0.60162602 0.63414634 0.60433604 0.62466125 0.59565807]
Mean CV Accuracy: 0.6121
Std Dev CV Accuracy: 0.0147


Parameters: { "use_label_encoder" } are not used.

  bst.update(dtrain, iteration=i, fobj=obj)


### 3. Random Forest Classifier

In [27]:
model_fst = RandomForestClassifier(random_state=42)
model_fst, scores = train_classification_model(model_fst, X_train_norm, y_train)

model_fst.fit(X_train_norm, y_train)

y_pred_train = model_fst.predict(X_train_norm)
y_pred_test = model_fst.predict(X_test_norm)

Cross-Validation Scores: [0.62601626 0.63279133 0.62872629 0.63143631 0.62686567]
Mean CV Accuracy: 0.6292
Std Dev CV Accuracy: 0.0026


## Scores

### 1. Logistic Regression

In [28]:
evaluate_classification_model (model_lr, X_train_norm, y_train, y_pred_train)

Performance Evaluation:
Accuracy: 1.0000
Confusion Matrix:
[[2300    0]
 [   0 1389]]
Classification Report:
              precision    recall  f1-score   support

       False       1.00      1.00      1.00      2300
        True       1.00      1.00      1.00      1389

    accuracy                           1.00      3689
   macro avg       1.00      1.00      1.00      3689
weighted avg       1.00      1.00      1.00      3689

ROC-AUC Score: 0.6578


(1.0,
 array([[2300,    0],
        [   0, 1389]]),
 '              precision    recall  f1-score   support\n\n       False       1.00      1.00      1.00      2300\n        True       1.00      1.00      1.00      1389\n\n    accuracy                           1.00      3689\n   macro avg       1.00      1.00      1.00      3689\nweighted avg       1.00      1.00      1.00      3689\n',
 np.float64(0.6577888377625443))

In [29]:
evaluate_classification_model (model_lr, X_test_norm, y_test, y_pred_test)

Performance Evaluation:
Accuracy: 0.6208
Confusion Matrix:
[[477  96]
 [254  96]]
Classification Report:
              precision    recall  f1-score   support

       False       0.65      0.83      0.73       573
        True       0.50      0.27      0.35       350

    accuracy                           0.62       923
   macro avg       0.58      0.55      0.54       923
weighted avg       0.59      0.62      0.59       923

ROC-AUC Score: 0.6118


(0.6208017334777898,
 array([[477,  96],
        [254,  96]]),
 '              precision    recall  f1-score   support\n\n       False       0.65      0.83      0.73       573\n        True       0.50      0.27      0.35       350\n\n    accuracy                           0.62       923\n   macro avg       0.58      0.55      0.54       923\nweighted avg       0.59      0.62      0.59       923\n',
 np.float64(0.6117825978558963))

### 2. XGB Classifier

In [30]:
evaluate_classification_model (model_xgb, X_train_norm, y_train, y_pred_train)

Performance Evaluation:
Accuracy: 1.0000
Confusion Matrix:
[[2300    0]
 [   0 1389]]
Classification Report:
              precision    recall  f1-score   support

       False       1.00      1.00      1.00      2300
        True       1.00      1.00      1.00      1389

    accuracy                           1.00      3689
   macro avg       1.00      1.00      1.00      3689
weighted avg       1.00      1.00      1.00      3689

ROC-AUC Score: 0.9950


(1.0,
 array([[2300,    0],
        [   0, 1389]]),
 '              precision    recall  f1-score   support\n\n       False       1.00      1.00      1.00      2300\n        True       1.00      1.00      1.00      1389\n\n    accuracy                           1.00      3689\n   macro avg       1.00      1.00      1.00      3689\nweighted avg       1.00      1.00      1.00      3689\n',
 np.float64(0.9950192506338623))

In [31]:
evaluate_classification_model (model_xgb, X_test_norm, y_test, y_pred_test)

Performance Evaluation:
Accuracy: 0.6208
Confusion Matrix:
[[477  96]
 [254  96]]
Classification Report:
              precision    recall  f1-score   support

       False       0.65      0.83      0.73       573
        True       0.50      0.27      0.35       350

    accuracy                           0.62       923
   macro avg       0.58      0.55      0.54       923
weighted avg       0.59      0.62      0.59       923

ROC-AUC Score: 0.6009


(0.6208017334777898,
 array([[477,  96],
        [254,  96]]),
 '              precision    recall  f1-score   support\n\n       False       0.65      0.83      0.73       573\n        True       0.50      0.27      0.35       350\n\n    accuracy                           0.62       923\n   macro avg       0.58      0.55      0.54       923\nweighted avg       0.59      0.62      0.59       923\n',
 np.float64(0.6008676140613313))

### 3. Random Forest Classifier

In [32]:
evaluate_classification_model (model_fst, X_train_norm, y_train, y_pred_train)

Performance Evaluation:
Accuracy: 1.0000
Confusion Matrix:
[[2300    0]
 [   0 1389]]
Classification Report:
              precision    recall  f1-score   support

       False       1.00      1.00      1.00      2300
        True       1.00      1.00      1.00      1389

    accuracy                           1.00      3689
   macro avg       1.00      1.00      1.00      3689
weighted avg       1.00      1.00      1.00      3689

ROC-AUC Score: 1.0000


(1.0,
 array([[2300,    0],
        [   0, 1389]]),
 '              precision    recall  f1-score   support\n\n       False       1.00      1.00      1.00      2300\n        True       1.00      1.00      1.00      1389\n\n    accuracy                           1.00      3689\n   macro avg       1.00      1.00      1.00      3689\nweighted avg       1.00      1.00      1.00      3689\n',
 np.float64(1.0))

In [33]:
evaluate_classification_model (model_fst, X_test_norm, y_test, y_pred_test)

Performance Evaluation:
Accuracy: 0.6208
Confusion Matrix:
[[477  96]
 [254  96]]
Classification Report:
              precision    recall  f1-score   support

       False       0.65      0.83      0.73       573
        True       0.50      0.27      0.35       350

    accuracy                           0.62       923
   macro avg       0.58      0.55      0.54       923
weighted avg       0.59      0.62      0.59       923

ROC-AUC Score: 0.6090


(0.6208017334777898,
 array([[477,  96],
        [254,  96]]),
 '              precision    recall  f1-score   support\n\n       False       0.65      0.83      0.73       573\n        True       0.50      0.27      0.35       350\n\n    accuracy                           0.62       923\n   macro avg       0.58      0.55      0.54       923\nweighted avg       0.59      0.62      0.59       923\n',
 np.float64(0.6090251807529294))

# Post-EDA

In [34]:
def get_importances (model, x, y):
    perm_importance = permutation_importance(model, x, y)

    importance_df = pd.DataFrame({
        'Feature': X_train.columns,  # If X_train is a DataFrame
        'Importance Mean': perm_importance.importances_mean,
        'Importance Std': perm_importance.importances_std
    })

    return importance_df.sort_values(by='Importance Mean', ascending=False)

## 1. Logistic Regression

In [35]:
get_importances (model_lr, X_train_norm, y_train)

Unnamed: 0,Feature,Importance Mean,Importance Std
2,preferred_channel_Online,0.035999,0.004835
79,duration,0.033776,0.007445
78,Web,0.029439,0.002494
77,Mobile App,0.016265,0.002079
5,state_California,0.014259,0.002692
...,...,...,...
39,"New York, Ny",-0.000813,0.000297
53,Product_Other,-0.000813,0.000514
60,Add_To_Cart,-0.000813,0.000767
23,gender_Prefer Not To Say,-0.001193,0.000632


In [36]:
get_importances (model_lr, X_test_norm, y_test)

Unnamed: 0,Feature,Importance Mean,Importance Std
2,preferred_channel_Online,0.032936,0.002334
63,Interaction_Other,0.027302,0.004668
77,Mobile App,0.025569,0.004678
3,preferred_channel_Other,0.021885,0.004718
69,Purchase,0.014085,0.005128
...,...,...,...
75,Channel_Other,-0.001517,0.002009
11,state_New Jersey,-0.001733,0.001105
66,Page_View,-0.001733,0.002009
17,state_Texas,-0.002384,0.001263


## 2. XGB Classifier

In [37]:
get_importances (model_xgb, X_train_norm, y_train)

Unnamed: 0,Feature,Importance Mean,Importance Std
2,preferred_channel_Online,0.191922,0.003834
0,preferred_channel_Both,0.163730,0.002357
1,preferred_channel_In-Store,0.144429,0.001973
4,state_Arizona,0.031607,0.002186
78,Web,0.024885,0.001299
...,...,...,...
26,Cash,0.000922,0.000368
73,Store_Map_View,0.000217,0.000108
18,state_Virginia,0.000217,0.000203
64,Inventory_Check,0.000054,0.000108


In [38]:
get_importances (model_xgb, X_test_norm, y_test)

Unnamed: 0,Feature,Importance Mean,Importance Std
0,preferred_channel_Both,0.012351,0.005675
4,state_Arizona,0.008017,0.005465
2,preferred_channel_Online,0.004334,0.006391
30,Google Pay,0.003684,0.004202
22,gender_Non-Binary,0.003250,0.001813
...,...,...,...
27,Credit Card,-0.004550,0.001592
3,preferred_channel_Other,-0.005417,0.002273
54,Small Kitchen Appliances,-0.006284,0.000433
78,Web,-0.012351,0.006036


## 3. Random Forest Classifier

In [39]:
get_importances (model_xgb, X_train_norm, y_train)

Unnamed: 0,Feature,Importance Mean,Importance Std
2,preferred_channel_Online,0.189320,0.007124
0,preferred_channel_Both,0.162700,0.002840
1,preferred_channel_In-Store,0.147682,0.006344
4,state_Arizona,0.031282,0.001065
78,Web,0.023313,0.002514
...,...,...,...
60,Add_To_Cart,0.000976,0.000503
18,state_Virginia,0.000596,0.000466
55,Smart Home Devices,0.000325,0.000266
73,Store_Map_View,0.000325,0.000108


In [40]:
get_importances (model_xgb, X_test_norm, y_test)

Unnamed: 0,Feature,Importance Mean,Importance Std
44,Bedding,0.003034,0.001864
2,preferred_channel_Online,0.003034,0.009254
34,"Chicago, Il",0.002600,0.002231
65,Notification_Click,0.002384,0.003521
30,Google Pay,0.002167,0.001370
...,...,...,...
61,App_Open,-0.006067,0.002432
20,gender_Female,-0.006284,0.002413
78,Web,-0.008017,0.006036
42,"Seattle, Wa",-0.008234,0.004525


# Conclusion
## Insight
1. The models turn out to be successful. They have a `83%` recall for `False` values. Meaning they can successfully identify which customers are not going to have high clv. This looks like a wasted effort. However we can now help the marketing team get early hints of which demographics tend to show low affinity to our site.
2. Using the filtered/selected features from the statistical analysis, we dont see any significant gain in accuracy.



## Future Analysis
1. We should first of all also include transactions/interactions for second time. That may give us better results. However if a customer purchased the second time, meaning he didnt completely churn, so he has a high likelyhood of having a high value clv anyway.
2. Work on some other problem.