## Introduction


Business Case Problem Statement

ABC Supermarket, a major player in the UK with multiple stores and a loyalty program with 250,000 participants, has launched a new line of organic products. To achieve fast product penetration, the company is planning to leverage its loyalty program by giving sample kits to its most probable buyers.

Problem:

ABC Supermarket needs to identify the most probable buyers of its new line of organic products from its pool of 250,000 loyalty program participants.

Challenge:

The company needs to develop a targeting strategy that is both efficient and effective. It needs to identify the most probable buyers with a high degree of accuracy, while also minimizing the cost of giving away sample kits.

Solution:

ABC Supermarket can use its loyalty program data to identify the most probable buyers of its new line of organic products. The company can segment its loyalty program participants based on their purchase history, demographics, and other relevant factors. It can then target the most promising segments with sample kits.

Expected Benefits:

By leveraging its loyalty program, ABC Supermarket can achieve fast product penetration for its new line of organic products. This will lead to increased sales and profitability for the company.

Metrics for Success:

The success of ABC Supermarket's loyalty program targeting strategy can be measured by the following metrics:

Percentage of sample kit recipients who purchase the new line of organic products
Increase in sales of the new line of organic products
Return on investment of the loyalty program targeting campaign

In [51]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
from sklearn.linear_model import LogisticRegression

In [52]:
import numpy as np
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import cross_val_score
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.naive_bayes import GaussianNB # Import Gaussian Naive Bayes from mlxtend.data import mnist_data
from sklearn.metrics import accuracy_score

In [53]:
import os 

In [54]:
os.getcwd()


'/Users/rajeshkumarroutray/Desktop'

In [55]:
os.chdir('/Users/rajeshkumarroutray/Desktop')


In [56]:
df = pd.read_excel("a1_Dataset_10Percent.xlsx")

Client shared data for ~10%, along with purchase decisions: This means that the client has shared data for 10% of their customers, which includes information about what products they have purchased in the past. 

This data will be used by the company to develop a model that can predict which customers are most likely to buy their product.

Client onboarded us formulate an Analytics-enabled Marketing Strategy to predict most probable buyers from ~90%: 
The client has hired the company to develop a marketing strategy that will use analytics to identify the most likely customers to buy their product. 

This will allow the client to target their marketing efforts more effectively and increase their sales.
With objective of optimizing profitability & market penetration, given:

Revenue from a successful buyer = 200   dollars : The company earns  200  dollars for every customer who buys their product.
Cost of promotional sample kit = 70   dollars: The company spends  70  dollars to send a promotional sample kit to each customer.

In [41]:
df.isna().sum()

ID                    0
DemAffl            1085
DemAge             1508
DemClusterGroup     674
DemGender          2512
DemReg              465
DemTVReg            465
LoyalClass            0
LoyalSpend            0
LoyalTime           281
TargetBuy             0
dtype: int64

In [57]:
df=df.drop(['ID'],axis=1)


In [58]:
df['DemAffl']=df['DemAffl'].fillna(df['DemAffl'].mode()[0])
df['DemAge']=df['DemAge'].fillna(df['DemAge'].mode()[0])
df['DemClusterGroup']=df['DemClusterGroup'].fillna(df['DemClusterGroup'].mode()[0])
df['DemGender']=df['DemGender'].fillna(df['DemGender'].mode()[0])
df['DemReg']=df['DemReg'].fillna(df['DemReg'].mode()[0])
df['DemTVReg']=df['DemTVReg'].fillna(df['DemTVReg'].mode()[0])
df['LoyalTime']=df['LoyalTime'].fillna(df['LoyalTime'].mean())

In [59]:
# converting to mumeric

from sklearn.preprocessing import LabelEncoder
number = LabelEncoder()

df['DemClusterGroup'] = number.fit_transform(df['DemClusterGroup'].astype('str'))
integer_mapping = {l: i for i, l in enumerate(number.classes_)}
print(integer_mapping)

df['DemGender'] = number.fit_transform(df['DemGender'].astype('str'))
integer_mapping = {l: i for i, l in enumerate(number.classes_)}
print(integer_mapping)

df['DemReg'] = number.fit_transform(df['DemReg'].astype('str'))
integer_mapping = {l: i for i, l in enumerate(number.classes_)}
print(integer_mapping)

df['DemTVReg'] = number.fit_transform(df['DemTVReg'].astype('str'))
integer_mapping = {l: i for i, l in enumerate(number.classes_)}
print(integer_mapping)

df['LoyalClass'] = number.fit_transform(df['LoyalClass'].astype('str'))
integer_mapping = {l: i for i, l in enumerate(number.classes_)}
print(integer_mapping)

{'A': 0, 'B': 1, 'C': 2, 'D': 3, 'E': 4, 'F': 5, 'U': 6}
{'F': 0, 'M': 1, 'U': 2}
{'Midlands': 0, 'North': 1, 'Scottish': 2, 'South East': 3, 'South West': 4}
{'Border': 0, 'C Scotland': 1, 'East': 2, 'London': 3, 'Midlands': 4, 'N East': 5, 'N Scot': 6, 'N West': 7, 'S & S East': 8, 'S West': 9, 'Ulster': 10, 'Wales & West': 11, 'Yorkshire': 12}
{'Gold': 0, 'Platinum': 1, 'Silver': 2, 'Tin': 3}


In [60]:
df_median = df.copy()

In [None]:
doing regression analysis for df

In [62]:
y = df.iloc[:, 9].values
X = df.iloc[:, 0:9].values

In [63]:
X_train, X_test, y_train, y_test = train_test_split(X, 
                                                    y,test_size=0.3, random_state=17, stratify=y)

In [64]:
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.naive_bayes import GaussianNB
from sklearn.model_selection import StratifiedKFold
import numpy as np

# Classifiers
clf1 = LogisticRegression(multi_class='multinomial', solver='newton-cg', random_state=1)
clf2 = KNeighborsClassifier(algorithm='ball_tree', leaf_size=50)
clf3 = DecisionTreeClassifier(random_state=1)
clf4 = SVC(random_state=1)
clf5 = GaussianNB()
clf6 = RandomForestClassifier(random_state=1)
clf7 = XGBClassifier(random_state=1)

# Building the pipelines
pipe1 = Pipeline([('std', StandardScaler()), ('clf1', clf1)])
pipe2 = Pipeline([('std', StandardScaler()), ('clf2', clf2)])
pipe3 = Pipeline([('std', StandardScaler()), ('clf3', clf3)])
pipe4 = Pipeline([('std', StandardScaler()), ('clf4', clf4)])
pipe5 = Pipeline([('std', StandardScaler()), ('clf5', clf5)])
pipe6 = Pipeline([('std', StandardScaler()), ('clf6', clf6)])
pipe7 = Pipeline([('std', StandardScaler()), ('clf7', clf7)])

# Setting up the parameter grids
param_grid1 = [{'clf1__penalty': ['l2'], 'clf1__C': np.power(10., np.arange(-4, 4))}]
param_grid2 = [{'clf2__n_neighbors': list(range(1, 10)), 'clf2__p': [1, 2]}]
param_grid3 = [{'clf3__max_depth': list(range(1, 10)) + [None], 'clf3__criterion': ['gini', 'entropy']}]

param_grid4 = [{'clf4__C': np.power(10., np.arange(-4, 4)),
                'clf4__kernel': ['rbf', 'linear'],
                'clf4__gamma': np.power(10., np.arange(-5, 0))}]
param_grid6 = [{'clf6__n_estimators': [50, 100, 200], 'clf6__max_depth': [10, 20, None]}]
param_grid7 = [{'clf7__n_estimators': [50, 100, 200], 'clf7__learning_rate': [0.01, 0.1, 0.2]}]



In [65]:
gridcvs = {}

# StratifiedKFold for inner cross-validation
inner_cv = StratifiedKFold(n_splits=2, shuffle=True, random_state=1)

# Loop through classifiers, pipelines, and parameter grids
for pgrid, est, name in zip((param_grid1, param_grid2, param_grid3, param_grid4, param_grid6, param_grid7),
                            (pipe1, pipe2, pipe3, pipe4, pipe6, pipe7),
                            ('Logreg', 'KNN', 'DTree', 'SVM', 'RandomForest', 'XGBoost')):
    gcv = GridSearchCV(estimator=est, param_grid=pgrid, scoring='accuracy', n_jobs=-1, cv=inner_cv, verbose=0, refit=True)
    gridcvs[name] = gcv

In [20]:
for name, gs_est in sorted(gridcvs.items()):

    print(50 * '-', '\n')
    print('Algorithm:', name)
    print('    Inner loop:')
    
    outer_scores = []
    outer_cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=1)
    
    
    for train_idx, valid_idx in outer_cv.split(X_train, y_train):
        
        gridcvs[name].fit(X_train[train_idx], y_train[train_idx]) # run inner loop hyperparam tuning
        print('\n        Best ACC (avg. of inner test folds) %.2f%%' % (gridcvs[name].best_score_ * 100))
        print('        Best paramete\\rs:', gridcvs[name].best_params_)
        
        # perf on test fold (valid_idx)
        outer_scores.append(gridcvs[name].best_estimator_.score(X_train[valid_idx], y_train[valid_idx]))
        print('        ACC (on outer test fold) %.2f%%' % (outer_scores[-1]*100))
    
    print('\n    Outer Loop:')
    print('        ACC %.2f%% +/- %.2f' % 
              (np.mean(outer_scores) * 100, np.std(outer_scores) * 100))

-------------------------------------------------- 

Algorithm: DTree
    Inner loop:

        Best ACC (avg. of inner test folds) 81.00%
        Best paramete\rs: {'clf3__criterion': 'gini', 'clf3__max_depth': 5}
        ACC (on outer test fold) 81.62%

        Best ACC (avg. of inner test folds) 81.28%
        Best paramete\rs: {'clf3__criterion': 'gini', 'clf3__max_depth': 5}
        ACC (on outer test fold) 80.46%

        Best ACC (avg. of inner test folds) 81.05%
        Best paramete\rs: {'clf3__criterion': 'gini', 'clf3__max_depth': 4}
        ACC (on outer test fold) 81.77%

        Best ACC (avg. of inner test folds) 81.43%
        Best paramete\rs: {'clf3__criterion': 'entropy', 'clf3__max_depth': 5}
        ACC (on outer test fold) 81.58%

        Best ACC (avg. of inner test folds) 81.23%
        Best paramete\rs: {'clf3__criterion': 'entropy', 'clf3__max_depth': 6}
        ACC (on outer test fold) 80.68%

    Outer Loop:
        ACC 81.22% +/- 0.54
-----------------------

In [66]:
gcv_model_select = GridSearchCV(estimator=pipe6, param_grid=param_grid6,
                                 scoring='accuracy', n_jobs=-1, cv=inner_cv, verbose=1, refit=True)
gcv_model_select.fit(X_train, y_train)

# Print results
print('Best CV accuracy: %.2f%%' % (gcv_model_select.best_score_ * 100))
print('Best parameters:', gcv_model_select.best_params_)

Fitting 2 folds for each of 9 candidates, totalling 18 fits
Best CV accuracy: 80.89%
Best parameters: {'clf6__max_depth': 10, 'clf6__n_estimators': 100}


In [22]:
import pandas as pd

# Assuming 'classifier' is the trained model
predictions = gcv_model_select.predict_proba(X_test)  # Assuming 'gcv_model_select' is the trained GridSearchCV model




In [76]:
import joblib

# Assuming gcv_model_select is your GridSearchCV instance
joblib.dump(gcv_model_select, './c2_GridSearchCV_LoyalCustomers')


['./c2_GridSearchCV_LoyalCustomers']

In [25]:
df_prediction_prob = pd.DataFrame(predictions, columns=['prob_0', 'prob_1'])

# Create DataFrames for test dataset and features
df_test_dataset = pd.DataFrame(y_test, columns=['Actual Outcome'])
df_x_test = pd.DataFrame(X_test)

# Concatenate all DataFrames
dfx = pd.concat([df_x_test, df_test_dataset, df_prediction_prob], axis=1)
excel_file_path = "/Users/rajeshkumarroutray/Downloads/c1_ModelOutput_10Percent.xlsx"

# Save the DataFrame to an Excel file
dfx.to_excel(excel_file_path, index=False)

In [84]:
os.chdir('/Users/rajeshkumarroutray/Downloads')


In [85]:
predict_data = pd.read_excel(" 90_percent_file.xlsx")

In [86]:
predict_data=predict_data.drop(['ID'],axis=1)


In [87]:

predict_data['DemAffl']=predict_data['DemAffl'].fillna(predict_data['DemAffl'].mode()[0])
predict_data['DemAge']=predict_data['DemAge'].fillna(predict_data['DemAge'].mode()[0])
predict_data['DemClusterGroup']=predict_data['DemClusterGroup'].fillna(predict_data['DemClusterGroup'].mode()[0])
predict_data['DemGender']=predict_data['DemGender'].fillna(predict_data['DemGender'].mode()[0])
predict_data['DemReg']=predict_data['DemReg'].fillna(predict_data['DemReg'].mode()[0])
predict_data['DemTVReg']=predict_data['DemTVReg'].fillna(predict_data['DemTVReg'].mode()[0])
predict_data['LoyalTime']=predict_data['LoyalTime'].fillna(predict_data['LoyalTime'].mean())

In [88]:
# converting to mumeric

from sklearn.preprocessing import LabelEncoder
number = LabelEncoder()

predict_data['DemClusterGroup'] = number.fit_transform(predict_data['DemClusterGroup'].astype('str'))
integer_mapping = {l: i for i, l in enumerate(number.classes_)}
print(integer_mapping)

predict_data['DemGender'] = number.fit_transform(predict_data['DemGender'].astype('str'))
integer_mapping = {l: i for i, l in enumerate(number.classes_)}
print(integer_mapping)

predict_data['DemReg'] = number.fit_transform(predict_data['DemReg'].astype('str'))
integer_mapping = {l: i for i, l in enumerate(number.classes_)}
print(integer_mapping)

predict_data['DemTVReg'] = number.fit_transform(predict_data['DemTVReg'].astype('str'))
integer_mapping = {l: i for i, l in enumerate(number.classes_)}
print(integer_mapping)

predict_data['LoyalClass'] = number.fit_transform(predict_data['LoyalClass'].astype('str'))
integer_mapping = {l: i for i, l in enumerate(number.classes_)}
print(integer_mapping)

{'A': 0, 'B': 1, 'C': 2, 'D': 3, 'E': 4, 'F': 5, 'U': 6}
{'F': 0, 'M': 1, 'U': 2}
{'Midlands': 0, 'North': 1, 'Scottish': 2, 'South East': 3, 'South West': 4}
{'Border': 0, 'C Scotland': 1, 'East': 2, 'London': 3, 'Midlands': 4, 'N East': 5, 'N Scot': 6, 'N West': 7, 'S & S East': 8, 'S West': 9, 'Ulster': 10, 'Wales & West': 11, 'Yorkshire': 12}
{'Gold': 0, 'Platinum': 1, 'Silver': 2, 'Tin': 3}


In [91]:
X_fresh = predict_data.iloc[:, 0:10].values

In [92]:
import joblib

# Load the saved GridSearchCV object
loaded_gcv_model_select = joblib.load('./c2_GridSearchCV_LoyalCustomers')

In [93]:
predictions = loaded_gcv_model_select.predict_proba(X_fresh)
predictions

array([[0.97223149, 0.02776851],
       [0.98240692, 0.01759308],
       [0.89866009, 0.10133991],
       ...,
       [0.9094924 , 0.0905076 ],
       [0.81691352, 0.18308648],
       [0.65520221, 0.34479779]])

In [None]:
df_90_percent = pd.DataFrame(predictions, columns=['prob_0', 'prob_1'])

# Create DataFrames for test dataset and features
df_test_dataset = pd.DataFrame(y_test, columns=['Actual Outcome'])
df_x_test = pd.DataFrame(X_test)

# Concatenate all DataFrames
dfx = pd.concat([df_x_test, df_test_dataset, df_prediction_prob], axis=1)
excel_file_path = "/Users/rajeshkumarroutray/Downloads/c1_ModelOutput_10Percent.xlsx"

# Save the DataFrame to a n Excel file
dfx.to_excel(excel_file_path, index=False)

In [94]:
df_90_prob = pd.DataFrame(predictions, columns = ['prob_0', 'prob_1'])
dfx=pd.concat([predict_data,df_90_prob], axis=1)
excel_file_path = "/Users/rajeshkumarroutray/Downloads/c1_ModelOutput_90Percentv2.xlsx"

# Save the DataFrame to an Excel file
dfx.to_excel(excel_file_path, index=False)

In [101]:
dfx = dfx.sort_values(by = 'prob_1', ascending = False)

In [106]:
dfx.drop('decile',axis =1 , inplace=True)