In [None]:
from IPython import display
display.Image('/Users/vamsi/Desktop/KPMG/image.png', width=1200)

## KPMG Internship Module_3

**Project Name: "Customers Recommendation Project"**  
**Client: "Sprocket Central Pty Ltd Company"**

**Project Brief:**  
Sprocket Central Pty Ltd, a medium-sized organization specializing in bikes and cycling accessories, has provided KPMG with three datasets: customer demographic, customer addresses, and transaction data for the past three months. The client needs help analyzing this data to optimize their marketing strategy for the new customer list.

**Module#01 Objective: Data Quality Assessment Report**  
In module #1, we cleaned and integrated the data.

**Module#02 Objective: Data Exploration**  
In module #02, we conducted a comprehensive data exploratory analysis, RFM analysis, and customer segmentation.   

**Module#03 Objective: Model buidling, training and testing**  
In module #03, the client provided an additional dataset called "New Customers List" comprising 1000 records of customers who haven't purchased any products. They need help identifying which customers to target with marketing campaigns based on this new dataset. We'll use a machine learning classification model trained on the old customer dataset, which includes RFM segmentations, to predict the most probable segment for each new customer. This approach will guide us in making informed decisions on which marketing campaigns to focus on.

# Table of Contents

- [1.0. Old Customers RFM Dataset Features Engineering](#old-customers-rfm-dataset-features-engineering)
    - [1.1. One-Hot Encoding](#one-hot-encoding)
    - [1.2. Label Encoding](#label-encoding)
- [2.0. New_Customers Dataset Features Engineering](#new_customers-dataset-features-engineering)
    - [2.1. One-Hot Encoding](#one-hot-encoding-1)
    - [2.2. Label Encoding](#label-encoding-1)
    - [2.3. Check](#check)
- [3.0. Model building - RFM_loyalty_level](#model-building-rfm_loyalty_level)
    - [3.1. Training the Model with Old Data](#training-the-model-with-old-data)
    - [3.2. Testing the Model with New Data](#testing-the-model-with-new-data)
- [4.0. Model building - RFMscore](#model-building-rfmscore)
    - [4.1. Training the Model with Old Data](#training-the-model-with-old-data-1)
    - [4.2. Testing the Model with New Data](#testing-the-model-with-new-data-1)
- [5.0. New Customers - Further Segmentation](#new-customers-further-segmentation)
   -  [5.1. New Customers - Predicted Results](#new-customers-predicted-results)


In [None]:
# import libraries
import pandas as pd
from pandas_profiling import ProfileReport
import numpy as np
import datetime as dt
import calendar
import seaborn as sns 
sns.set_style("whitegrid")

## 1.0. Old Customers RFM Dataset Features Engineering:<a id='old-customers-rfm-dataset-features-engineering'></a>

In [None]:
# read in transactions sheet from file
old_customers_rfm = pd.read_csv('old_customers_rfm.csv')
old_customers_rfm.columns

In [None]:
# drop unnamed column
old_customers_rfm.drop('Unnamed: 0',axis=1,inplace=True)
# check first few rows
old_customers_rfm.head()

A new dataframe will be created for training a classification model to predict RFM_loyalty_level for a fresh dataset of 1000 new customers with similar features.  

The dataset comprises of **34 columns and 19773 rows**. The columns are grouped into three primary categories, namely **transaction information, customer information, and customer demographics**.

Our main goal is to **identify any trends present in the data and determine the customer segment with the highest customer value**. To support our investigation, we will focus on specific features such as **list price, standard cost, past 3 years bike-related purchases, age, transaction month, day of the week, wealth segment, state, and gender**.

In [None]:
# create a new df called "old_customers" with selected columns from the "cdta_rfm" dataframe
old_customers = old_customers_rfm[['gender','past_3_years_bike_related_purchases','job_industry_category','wealth_segment','owns_car','tenure','age','property_valuation','RFM_loyalty_level','RFMscore']]
old_customers.info()

-------------------------------------------------------------------

In [None]:
# get the number of rows and columns
old_customers.shape

### 1.1. One-Hot Encoding:<a id='one-hot-encoding'></a>

For nominal columns (gender, job_industry, own_car), one-hot encoding will be used to transform them into binary values to be utilized in an ML model.

In [None]:
# change gender data columns using one hot coding into binary
gender=old_customers[['gender']]
gender=pd.get_dummies(gender,drop_first=True)
gender.head()

In [None]:
# change job_industry_category data columns using one hot coding into binary
job_industry_category=old_customers[['job_industry_category']]
job_industry_category=pd.get_dummies(job_industry_category,drop_first=True)
job_industry_category.head()

In [None]:
# change owns_car data columns using one hot coding into binary
owns_car=old_customers[['owns_car']]
owns_car=pd.get_dummies(owns_car,drop_first=True)
owns_car.head()

### 1.2. Label Encoding:<a id='label-encoding'></a>

The wealth_segment column will be converted into a binary column using label encoder, as it is an ordinal category column.

In [None]:
# change wealth_segment data columns using Label Encoder into binary
from sklearn.preprocessing import LabelEncoder
old_customers['wealth_segment_binary']=LabelEncoder().fit_transform(old_customers['wealth_segment'])

A new dataframe will be created, consisting of the binary-transformed columns and numerical columns, to be used in the ML model.

In [None]:
old_customers1=old_customers[['past_3_years_bike_related_purchases','tenure','age','property_valuation','wealth_segment_binary']]

In [None]:
# concatenate transformed categorical columns with the old_customers dataframe
old_customers1=pd.concat([gender,job_industry_category,owns_car,old_customers1],axis=1)

In [None]:
old_customers1.shape

In [None]:
# final result
old_customers1.head()

## 2.0. New_Customers Dataset Features Engineering:<a id='new_customers-dataset-features-engineering'></a>

In [None]:
# read in new_customers sheet from file
new_customers = pd.read_csv('new_customers.csv')

In [None]:
# drop unnamed column
new_customers.drop('Unnamed: 0',axis=1,inplace=True)
# get the number of rows and columns
new_customers.head()

### 2.1.One-Hot Encoding:<a id='one-hot-encoding-1'></a>

For nominal columns (gender, job_industry, own_car), one-hot encoding will be used to transform them into binary values to be utilized in an ML model.

In [None]:
# change categorical data columns into binary using one hot coding
gender_new=new_customers[['gender']]
gender_new=pd.get_dummies(gender_new,drop_first=True)
gender_new.head()

In [None]:
# change job_industry_category_new categorical column into binary using one hot coding
job_industry_category_new=new_customers[['job_industry_category']]
job_industry_category_new=pd.get_dummies(job_industry_category_new,drop_first=True)
job_industry_category_new.head()

In [None]:
# change owns_car_new categorical column into binary using one hot coding
owns_car_new=new_customers[['owns_car']]
owns_car_new=pd.get_dummies(owns_car_new,drop_first=True)
owns_car_new.head()

### 2.2. Label Encoding:<a id='label-encoding-1'></a>

The wealth_segment column will be converted into a binary column using label encoder, as it is an ordinal category column.

In [None]:
# change wealth_segment data columns using Label Encoder into binary
new_customers['wealth_segment_binary'] = LabelEncoder().fit_transform(new_customers['wealth_segment'])

A new dataframe will be created, consisting of the binary-transformed columns and numerical columns, to be used in the ML model.

In [None]:
#create a new dataframe with numerical values only
new_customers1=new_customers[['past_3_years_bike_related_purchases','tenure','age','property_valuation','wealth_segment_binary']]

In [None]:
# Concatenate transformed categorical columns with the new_customer numerical dataframe
new_customers1=pd.concat([gender_new,job_industry_category_new,owns_car_new,new_customers1],axis=1)

In [None]:
new_customers1.head()

### 2.3. Check:<a id='check'></a>
Now checking for the old and new transformed datasets

In [None]:
old_customers1.shape

In [None]:
old_customers1.info()

In [None]:
new_customers1.shape

In [None]:
new_customers1.info()

## 3.0. Model building - RFM_loyalty_level<a id='model-building-rfm_loyalty_level'></a>

### 3.1. Training the Model with Old Data:<a id='training-the-model-with-old-data'></a>
The ML model will be trained on the old customers dataset and used to predict on the new customers dataset. We will then evaluate the performance of the model using appropriate metrics.

In [None]:
from sklearn.model_selection import train_test_split
train_features, test_features, train_labels, test_labels = train_test_split(old_customers1,old_customers['RFM_loyalty_level'],test_size= 0.25, random_state=10,)

In [None]:
# decision tree
from sklearn.tree import DecisionTreeClassifier
tree = DecisionTreeClassifier(random_state=10)
tree.fit(train_features, train_labels)

# predict the labels for the test data
pred_labels_tree = tree.predict(test_features)

# create the classification report
from sklearn.metrics import classification_report
class_rep_tree = classification_report(test_labels, pred_labels_tree)

# view the performance of the model
print("Decision Tree: \n", class_rep_tree)

In [None]:
# decision RandomForestClassifier
from sklearn.ensemble import RandomForestClassifier
rs = RandomForestClassifier()
rs.fit(train_features, train_labels)

# predict the labels for the test data
pred_labels_rs = rs.predict(test_features)

# create the classification report
class_rep_rs = classification_report(test_labels, pred_labels_rs)

# view the performance of the model
print("RandomForestClassifier: \n", class_rep_rs)

### 3.2. Testing the Model with New Data:<a id='testing-the-model-with-new-data'></a>
The decision tree model will be utilized to predict new segments on the new data.

In [None]:
# predict the new segments using decision tree model
output_label = tree.predict(new_customers1)

#The predicted array from the decision tree model will be concatenated onto the new customers dataset as a new dataframe column.

# convert an array into a dataframe column
new_customers['RFM_segments_predicted']=output_label.tolist()

# check final results
new_customers

## 4.0. Model building - RFMscore:<a id='model-building-rfmscore'></a>

### 4.1. Training the Model with Old Data:<a id='training-the-model-with-old-data-1'></a>
The ML model will be trained on the old customers dataset and used to predict on the new customers dataset. We will then evaluate the performance of the model using appropriate metrics.

In [None]:
from sklearn.model_selection import train_test_split
train_features, test_features, train_labels, test_labels = train_test_split(old_customers1,old_customers['RFMscore'],test_size= 0.25, random_state=10,)

# decision tree
from sklearn.tree import DecisionTreeClassifier
tree = DecisionTreeClassifier(random_state=10)
tree.fit(train_features, train_labels)

# predict the labels for the test data
pred_labels_tree = tree.predict(test_features)

# create the classification report
from sklearn.metrics import classification_report
class_rep_tree = classification_report(test_labels, pred_labels_tree)

# view the performance of the model
print("Decision Tree: \n", class_rep_tree)

In [None]:
# decision RandomForestClassifier
from sklearn.ensemble import RandomForestClassifier
rs = RandomForestClassifier()
rs.fit(train_features, train_labels)

# predict the labels for the test data
pred_labels_rs = rs.predict(test_features)

# create the classification report
class_rep_rs = classification_report(test_labels, pred_labels_rs)

# view the performance of the model
print("RandomForestClassifier: \n", class_rep_rs)

### 4.2. Testing the Model with New Data:<a id='testing-the-model-with-new-data-1'></a>
The decision tree model will be utilized to predict new segments on the new data.

In [None]:
# predict the new segments using decision tree model
output_label = tree.predict(new_customers1)

# convert an array into a dataframe column
new_customers['RFM_score']=output_label.tolist()

# check final results
new_customers

## 5.0. New Customers - Further Segmentation:<a id='new-customers-further-segmentation'></a>

In [None]:
new_customers['RFM_score'].value_counts()

In [None]:
customer_title = {3: 'Evasive Customer',
                  4: 'Almost Lost Customer',
                  5: 'High Risk Customer',
                  6: 'Losing Customer',
                  7: 'Late bloomer',
                  8: 'Potential Customer',
                  9: 'Recent Customer',
                  10: 'Becoming Loyal',
                  11: 'Very Loyal',
                  12: 'Platinum Customer'}
new_customers['customer_title'] = new_customers['RFM_score'].map(customer_title)
new_customers

In [None]:
customer_title_description = { 
                  'Evasive Customer':'Very low recency, Very low frequency, small amount spent',
                  'Almost Lost Customer':'Very low recency, low frequency, but high amount spent',
                  'High Risk Customer':'Purchase was long time ago, frequency is quite high, amount spent is high',
                  'Losing Customer':'Purchases was a while ago, below average RFM value',
                  'Late bloomer':'No purchases recently, but RFM value is larger than average',
                  'Potential Customer':'Bought recently, never bought before, spent small amount',
                  'Recent Customer':'Bought recently, not very often, average money spent',
                  'Becoming Loyal':'Relatively recent, bought more than once, spends large amount of money',
                  'Very Loyal':'Most recent, buys often, spends large amount of money',
                  'Platinum Customer':'Most recent buy, buys often, most spent'}
new_customers['customer_title_description'] = new_customers['customer_title'].map(customer_title_description)
new_customers

In [None]:
customer_rank = {
    'Platinum Customer': 1,
    'Very Loyal': 2,
    'Becoming Loyal': 3,
    'Recent Customer': 4,
    'Potential Customer': 5,
    'Late bloomer': 6,
    'Losing Customer': 7,
    'High Risk Customer': 8,
    'Almost Lost Customer': 9,
    'Evasive Customer': 10,
    'Last customer': 11
}
new_customers['customer_rank'] = new_customers['customer_title'].map(customer_rank)
new_customers

### 5.1. New Customers - Predicted Results:<a id='new-customers-predicted-results'></a>

In [None]:
new_customers

In [None]:
# count the occurrences of RFM_loyalty_level
import matplotlib.pyplot as plt

counts = new_customers['RFM_segments_predicted'].value_counts()
counts.plot(kind='bar',color='teal')
plt.title('new_customers_RFM_loyalty_level counts')
plt.xlabel('RFM_segments_predicted')
plt.ylabel('count')

# add percentages on top of the bars
for i, v in enumerate(counts):
    plt.text(i, v + 1, f'{(v/counts.sum()*100):.1f}%', ha='center')
    
plt.show()

**Graph observations:** Based on the graph, it can be inferred that the majority of the customer base, roughly 63%, consists of silver and bronze customers, while gold and platinum customers make up approximately 37% of the total customer base.

In [None]:
# count the occurrences of customer_rank
counts = new_customers['customer_title'].value_counts()
counts.plot(kind='bar',color='teal')
plt.title('new_customers_customer_title counts')
plt.xlabel('customer_title')
plt.ylabel('count')

# add percentages on top of the bars
for i, v in enumerate(counts):
    plt.text(i, v + 1, f'{(v/counts.sum()*100):.1f}%', ha='center')
    
plt.show()

**Graph observations:** Based on the graph, it can be inferred that that the poorly performing customer segments, including high risk, losing, almost lost, and evasive customers, account for roughly 38% of the customer base. On the other hand, the remaining 62% is made up of customer segments such as late bloomers, potential customers, recent customers, becoming loyal customers, very loyal customers, and platinum customers. This segregation is especially useful for determining which ones are at risk of leaving, and which ones we should focus on for retention efforts.

In [None]:
# count the occurrences of RFM_segments_predicted
new_customers['RFM_segments_predicted'].value_counts()

In [None]:
# selecting customers with Platinum loyalty level 
platinum_customers = new_customers[new_customers['RFM_segments_predicted'] == 'Platinum']
platinum_customers

In [None]:
# selecting customers with gold loyalty level 
gold_customers = new_customers[new_customers['RFM_segments_predicted'] == 'Gold']
gold_customers

In [None]:
# exporting our new_customers dataframe 
new_customers.to_csv('most_valued_customers.csv')

----------------------------------------------------------------------------------------------------------------------

Cheers,    
Vamsi Krishna Kamatham