<a href="https://colab.research.google.com/github/sanaa-04/CSAT_ScorePrediction/blob/main/CSAT_Score_DL.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Overview of Project**
This project focuses on predicting Customer Satisfaction (CSAT) scores using Deep Learning Artificial Neural Networks (ANN). In the context of e-commerce, understanding customer satisfaction through their interactions and feedback is crucial for enhancing service quality, customer retention, and overall business growth. By leveraging advanced neural network models, we aim to accurately forecast CSAT scores based on a myriad of interaction-related features, providing actionable insights for service improvement.

## **About Dataset**
The dataset captures customer satisfaction scores for a one-month period at an e-commerce platform called Shopzilla (a pseudonym). It includes various features such as category and sub-category of interaction, customer remarks, survey response date, category, item price, agent details (name, supervisor, manager), and CSAT score etc.

## **Project Background**
Customer satisfaction in the e-commerce sector is a pivotal metric that influences loyalty, repeat business, and word-of-mouth marketing. Traditionally, companies have relied on direct surveys to gauge customer satisfaction, which can be time-consuming and may not always capture the full spectrum of customer experiences. With the advent of deep learning, it's now possible to predict customer satisfaction scores in real-time, offering a granular view of service performance and identifying areas for immediate improvement.

## **Data Description:**
Unique id: Unique identifier for each record

Channel name: Name of the customer service channel

Category: Category of the interaction

Sub-category: Sub-category of the interaction

Customer Remarks: Feedback provided by the customer

Order id: Identifier for the order associated with the interaction

Order date time: Date and time of the order

Issue reported at: Timestamp when the issue was reported

Issue responded: Timestamp when the issue was responded to

Survey response date: Date of the customer survey response

Customer city: City of the customer

Product category: Category of the product

Item price: Price of the item

Connected handling time: Time taken to handle the interaction

Agent name: Name of the customer service agent

Supervisor: Name of the supervisor

Manager: Name of the manager

Tenure Bucket: Bucket categorizing agent tenure

Agent Shift: Shift timing of the agent

CSAT Score: Customer Satisfaction (CSAT) score

## **Project Goal**
The primary goal of this project is to develop a deep learning model that can accurately predict the CSAT scores based on customer interactions and feedback. By doing so, we aim to provide e-commerce businesses with a powerful tool to monitor and enhance customer satisfaction in real-time, thereby improving service quality and fostering customer loyalty.

In [34]:
#Libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings("ignore")
import zipfile

from sklearn.preprocessing import FunctionTransformer
import scipy.stats as stats

In [35]:
#Extract, read data and Create data frame
df = pd.read_csv("/content/drive/MyDrive/eCommerce_Customer_support_data.csv")

In [7]:
df.head()

  cast_date_col = pd.to_datetime(column, errors="coerce")


Unnamed: 0,Unique id,channel_name,category,Sub-category,Customer Remarks,Order_id,order_date_time,Issue_reported at,issue_responded,Survey_response_Date,Customer_City,Product_category,Item_price,connected_handling_time,Agent_name,Supervisor,Manager,Tenure Bucket,Agent Shift,CSAT Score
0,7e9ae164-6a8b-4521-a2d4-58f7c9fff13f,Outcall,Product Queries,Life Insurance,,c27c9bb4-fa36-4140-9f1f-21009254ffdb,,01/08/2023 11:13,01/08/2023 11:47,01-Aug-23,,,,,Richard Buchanan,Mason Gupta,Jennifer Nguyen,On Job Training,Morning,5
1,b07ec1b0-f376-43b6-86df-ec03da3b2e16,Outcall,Product Queries,Product Specific Information,,d406b0c7-ce17-4654-b9de-f08d421254bd,,01/08/2023 12:52,01/08/2023 12:54,01-Aug-23,,,,,Vicki Collins,Dylan Kim,Michael Lee,>90,Morning,5
2,200814dd-27c7-4149-ba2b-bd3af3092880,Inbound,Order Related,Installation/demo,,c273368d-b961-44cb-beaf-62d6fd6c00d5,,01/08/2023 20:16,01/08/2023 20:38,01-Aug-23,,,,,Duane Norman,Jackson Park,William Kim,On Job Training,Evening,5
3,eb0d3e53-c1ca-42d3-8486-e42c8d622135,Inbound,Returns,Reverse Pickup Enquiry,,5aed0059-55a4-4ec6-bb54-97942092020a,,01/08/2023 20:56,01/08/2023 21:16,01-Aug-23,,,,,Patrick Flores,Olivia Wang,John Smith,>90,Evening,5
4,ba903143-1e54-406c-b969-46c52f92e5df,Inbound,Cancellation,Not Needed,,e8bed5a9-6933-4aff-9dc6-ccefd7dcde59,,01/08/2023 10:30,01/08/2023 10:32,01-Aug-23,,,,,Christopher Sanchez,Austin Johnson,Michael Lee,0-30,Morning,5


In [82]:
print("This data set has",df.shape[0],"rows and",df.shape[1],"columns.")

This data set has 85907 rows and 20 columns.


In [36]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 85907 entries, 0 to 85906
Data columns (total 20 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   Unique id                85907 non-null  object 
 1   channel_name             85907 non-null  object 
 2   category                 85907 non-null  object 
 3   Sub-category             85907 non-null  object 
 4   Customer Remarks         28742 non-null  object 
 5   Order_id                 67675 non-null  object 
 6   order_date_time          17214 non-null  object 
 7   Issue_reported at        85907 non-null  object 
 8   issue_responded          85907 non-null  object 
 9   Survey_response_Date     85907 non-null  object 
 10  Customer_City            17079 non-null  object 
 11  Product_category         17196 non-null  object 
 12  Item_price               17206 non-null  float64
 13  connected_handling_time  242 non-null    float64
 14  Agent_name            

In [37]:
df.dtypes

Unnamed: 0,0
Unique id,object
channel_name,object
category,object
Sub-category,object
Customer Remarks,object
Order_id,object
order_date_time,object
Issue_reported at,object
issue_responded,object
Survey_response_Date,object


In [83]:
df.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Item_price,17206.0,5660.774846,12825.728411,0.0,392.0,979.0,2699.75,164999.0
connected_handling_time,242.0,462.400826,246.295037,0.0,293.0,427.0,592.25,1986.0
CSAT Score,85907.0,4.242157,1.378903,1.0,4.0,5.0,5.0,5.0


Describe is actually an 5 point summary which will calculate for numerical values. So the reason only 3 variable got described.

* Item_price and connected_handling_time has its bell shape curve with a huge left skewness. and these both seems to have too many outlier, which can be clearly explained in visualization.

* CSAT variable has numerical Dtype but it looks like a categorical variable.

In [40]:
df.duplicated().sum()

np.int64(0)

In [41]:
df.isna().sum()

Unnamed: 0,0
Unique id,0
channel_name,0
category,0
Sub-category,0
Customer Remarks,57165
Order_id,18232
order_date_time,68693
Issue_reported at,0
issue_responded,0
Survey_response_Date,0


## **Data Cleaning**

Several columns contain missing values:

1. "Customer Remarks" has 57,165 missing entries.
2. "Order_id,"
3. "order_date_time,"
4. "Customer_City,"
5. "Product_category,"
6. "Item_price," and "connected_handling_time" have varying degrees of missing data.

In [42]:
# Drop unnecessary columns
df_cleaned = df.drop(["Customer Remarks", "Order_id", "order_date_time"], axis=1)

# Impute missing values for numerical features
df_cleaned["Item_price"].fillna(df_cleaned["Item_price"].median(), inplace=True)
df_cleaned["connected_handling_time"].fillna(df_cleaned["connected_handling_time"].median(), inplace=True)

# Impute missing values for categorical features
df_cleaned["Customer_City"].fillna("Unknown", inplace=True)
df_cleaned["Product_category"].fillna("Unknown", inplace=True)

# Convert timestamp columns to datetime format
timestamp_columns = ["Issue_reported at", "issue_responded", "Survey_response_Date"]
for column in timestamp_columns:
    df_cleaned[column] = pd.to_datetime(df_cleaned[column], errors='coerce')

# Impute missing values for timestamp features
for column in ["Issue_reported at", "issue_responded"]:
    df_cleaned[column].fillna(df_cleaned[column].median(), inplace=True)

In [43]:
df_cleaned.isna().sum()

Unnamed: 0,0
Unique id,0
channel_name,0
category,0
Sub-category,0
Issue_reported at,0
issue_responded,0
Survey_response_Date,0
Customer_City,0
Product_category,0
Item_price,0


In [44]:
df_cleaned.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 85907 entries, 0 to 85906
Data columns (total 17 columns):
 #   Column                   Non-Null Count  Dtype         
---  ------                   --------------  -----         
 0   Unique id                85907 non-null  object        
 1   channel_name             85907 non-null  object        
 2   category                 85907 non-null  object        
 3   Sub-category             85907 non-null  object        
 4   Issue_reported at        85907 non-null  datetime64[ns]
 5   issue_responded          85907 non-null  datetime64[ns]
 6   Survey_response_Date     85907 non-null  datetime64[ns]
 7   Customer_City            85907 non-null  object        
 8   Product_category         85907 non-null  object        
 9   Item_price               85907 non-null  float64       
 10  connected_handling_time  85907 non-null  float64       
 11  Agent_name               85907 non-null  object        
 12  Supervisor               85907 n

In [45]:
# Creating a new column which has the difference of these columns Issue_reported at	issue_responded in minutes

df_cleaned['issue_responded_in_minutes'] = (df_cleaned['issue_responded'] - df_cleaned['Issue_reported at']).dt.total_seconds() / 60
df_cleaned['issue_responded_in_minutes'] = df_cleaned['issue_responded_in_minutes'].astype(int)

In [29]:
df_cleaned['issue_responded_in_minutes']

Unnamed: 0,issue_responded_in_minutes
0,34
1,2
2,22
3,20
4,2
...,...
85902,12
85903,12
85904,12
85905,12


In [46]:
# droping Issue_reported at	and issue_responded columns

df_cleaned = df_cleaned.drop(["Issue_reported at", "issue_responded"], axis=1)
df_cleaned.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 85907 entries, 0 to 85906
Data columns (total 16 columns):
 #   Column                      Non-Null Count  Dtype         
---  ------                      --------------  -----         
 0   Unique id                   85907 non-null  object        
 1   channel_name                85907 non-null  object        
 2   category                    85907 non-null  object        
 3   Sub-category                85907 non-null  object        
 4   Survey_response_Date        85907 non-null  datetime64[ns]
 5   Customer_City               85907 non-null  object        
 6   Product_category            85907 non-null  object        
 7   Item_price                  85907 non-null  float64       
 8   connected_handling_time     85907 non-null  float64       
 9   Agent_name                  85907 non-null  object        
 10  Supervisor                  85907 non-null  object        
 11  Manager                     85907 non-null  object    

In [47]:
df_cleaned.columns

Index(['Unique id', 'channel_name', 'category', 'Sub-category',
       'Survey_response_Date', 'Customer_City', 'Product_category',
       'Item_price', 'connected_handling_time', 'Agent_name', 'Supervisor',
       'Manager', 'Tenure Bucket', 'Agent Shift', 'CSAT Score',
       'issue_responded_in_minutes'],
      dtype='object')

In [48]:
df_cleaned.head()

Unnamed: 0,Unique id,channel_name,category,Sub-category,Survey_response_Date,Customer_City,Product_category,Item_price,connected_handling_time,Agent_name,Supervisor,Manager,Tenure Bucket,Agent Shift,CSAT Score,issue_responded_in_minutes
0,7e9ae164-6a8b-4521-a2d4-58f7c9fff13f,Outcall,Product Queries,Life Insurance,2023-08-01,Unknown,Unknown,979.0,427.0,Richard Buchanan,Mason Gupta,Jennifer Nguyen,On Job Training,Morning,5,34
1,b07ec1b0-f376-43b6-86df-ec03da3b2e16,Outcall,Product Queries,Product Specific Information,2023-08-01,Unknown,Unknown,979.0,427.0,Vicki Collins,Dylan Kim,Michael Lee,>90,Morning,5,2
2,200814dd-27c7-4149-ba2b-bd3af3092880,Inbound,Order Related,Installation/demo,2023-08-01,Unknown,Unknown,979.0,427.0,Duane Norman,Jackson Park,William Kim,On Job Training,Evening,5,22
3,eb0d3e53-c1ca-42d3-8486-e42c8d622135,Inbound,Returns,Reverse Pickup Enquiry,2023-08-01,Unknown,Unknown,979.0,427.0,Patrick Flores,Olivia Wang,John Smith,>90,Evening,5,20
4,ba903143-1e54-406c-b969-46c52f92e5df,Inbound,Cancellation,Not Needed,2023-08-01,Unknown,Unknown,979.0,427.0,Christopher Sanchez,Austin Johnson,Michael Lee,0-30,Morning,5,2


In [51]:
# creating a new_data with the columns channel name, issue solved time in minute, item price, tenure bucket and csat score

new_data = df_cleaned[['channel_name', 'issue_responded_in_minutes', 'Item_price', 'Tenure Bucket', 'CSAT Score']]
new_data.head()
new_data.info()
new_data.dtypes
new_data.describe()
new_data.isna().sum()
new_data.duplicated().sum()
new_data.columns

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 85907 entries, 0 to 85906
Data columns (total 5 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   channel_name                85907 non-null  object 
 1   issue_responded_in_minutes  85907 non-null  int64  
 2   Item_price                  85907 non-null  float64
 3   Tenure Bucket               85907 non-null  object 
 4   CSAT Score                  85907 non-null  int64  
dtypes: float64(1), int64(2), object(2)
memory usage: 3.3+ MB


Index(['channel_name', 'issue_responded_in_minutes', 'Item_price',
       'Tenure Bucket', 'CSAT Score'],
      dtype='object')

In [52]:
# scaling both numerical columns

from sklearn.preprocessing import MinMaxScaler

numerical_cols = ['issue_responded_in_minutes', 'Item_price']
scaler = MinMaxScaler()
new_data[numerical_cols] = scaler.fit_transform(new_data[numerical_cols])

new_data.head()


Unnamed: 0,channel_name,issue_responded_in_minutes,Item_price,Tenure Bucket,CSAT Score
0,Outcall,0.596426,0.005933,On Job Training,5
1,Outcall,0.596353,0.005933,>90,5
2,Inbound,0.596398,0.005933,On Job Training,5
3,Inbound,0.596394,0.005933,>90,5
4,Inbound,0.596353,0.005933,0-30,5


In [53]:
new_data = pd.get_dummies(new_data,columns=["channel_name","Tenure Bucket"],drop_first=True)


In [54]:
new_data.head()

Unnamed: 0,issue_responded_in_minutes,Item_price,CSAT Score,channel_name_Inbound,channel_name_Outcall,Tenure Bucket_31-60,Tenure Bucket_61-90,Tenure Bucket_>90,Tenure Bucket_On Job Training
0,0.596426,0.005933,5,False,True,False,False,False,True
1,0.596353,0.005933,5,False,True,False,False,True,False
2,0.596398,0.005933,5,True,False,False,False,False,True
3,0.596394,0.005933,5,True,False,False,False,True,False
4,0.596353,0.005933,5,True,False,False,False,False,False


In [55]:
# converting True and False values of channel_name_Inbound	channel_name_Outcall	Tenure Bucket_31-60	Tenure Bucket_61-90	Tenure Bucket_>90	Tenure Bucket_On Job Training in new data into 1 and 0

columns_to_convert = ['channel_name_Inbound', 'channel_name_Outcall', 'Tenure Bucket_31-60', 'Tenure Bucket_61-90', 'Tenure Bucket_>90', 'Tenure Bucket_On Job Training']
for col in columns_to_convert:
    new_data[col] = new_data[col].astype(int)

new_data.head()
new_data.dtypes

Unnamed: 0,0
issue_responded_in_minutes,float64
Item_price,float64
CSAT Score,int64
channel_name_Inbound,int64
channel_name_Outcall,int64
Tenure Bucket_31-60,int64
Tenure Bucket_61-90,int64
Tenure Bucket_>90,int64
Tenure Bucket_On Job Training,int64


In [56]:
new_data.head()

Unnamed: 0,issue_responded_in_minutes,Item_price,CSAT Score,channel_name_Inbound,channel_name_Outcall,Tenure Bucket_31-60,Tenure Bucket_61-90,Tenure Bucket_>90,Tenure Bucket_On Job Training
0,0.596426,0.005933,5,0,1,0,0,0,1
1,0.596353,0.005933,5,0,1,0,0,1,0
2,0.596398,0.005933,5,1,0,0,0,0,1
3,0.596394,0.005933,5,1,0,0,0,1,0
4,0.596353,0.005933,5,1,0,0,0,0,0


## **Model Training**

In [57]:
import tensorflow
from tensorflow import keras
from tensorflow.keras import Sequential
from tensorflow.keras.layers import Dense,Flatten

In [58]:
X = new_data.drop(columns=['CSAT Score'])
y = new_data['CSAT Score'].values

from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2,random_state=0)

In [73]:
from keras.models import Sequential
from keras.layers import Dense, Flatten, LeakyReLU

model = Sequential()

# Flatten layer
# model.add(Flatten(input_shape=(7, 1)))

# First Dense layer with Leaky ReLU
model.add(Dense(128, input_shape=(X_train.shape[1],)))
model.add(LeakyReLU(alpha=0.01))
model.add(Dense(128))
model.add(LeakyReLU(alpha=0.01))
model.add(Dense(128))
model.add(LeakyReLU(alpha=0.01))
model.add(Dense(128))
model.add(LeakyReLU(alpha=0.01))
model.add(Dense(128))
model.add(LeakyReLU(alpha=0.01))
# Second Dense layer with Leaky ReLU
# Output layer with softmax activation
model.add(Dense(6, activation='softmax'))

# Summary of the model
model.summary()

In [74]:
model.compile(loss='sparse_categorical_crossentropy',optimizer='Adam',metrics=['accuracy'])

In [75]:
#an epoch is the number of times a training dataset passes through an algorithm
history = model.fit(X_train,y_train,epochs=25,validation_split=0.2)


Epoch 1/25
[1m1719/1719[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m12s[0m 5ms/step - accuracy: 0.6891 - loss: 0.9905 - val_accuracy: 0.6957 - val_loss: 0.9479
Epoch 2/25
[1m1719/1719[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m9s[0m 5ms/step - accuracy: 0.6931 - loss: 0.9549 - val_accuracy: 0.6957 - val_loss: 0.9549
Epoch 3/25
[1m1719/1719[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m8s[0m 4ms/step - accuracy: 0.6947 - loss: 0.9470 - val_accuracy: 0.6952 - val_loss: 0.9492
Epoch 4/25
[1m1719/1719[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m11s[0m 5ms/step - accuracy: 0.6919 - loss: 0.9531 - val_accuracy: 0.6954 - val_loss: 0.9451
Epoch 5/25
[1m1719/1719[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m11s[0m 5ms/step - accuracy: 0.6920 - loss: 0.9553 - val_accuracy: 0.6957 - val_loss: 0.9455
Epoch 6/25
[1m1719/1719[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m10s[0m 5ms/step - accuracy: 0.6918 - loss: 0.9542 - val_accuracy: 0.6957 - val_loss: 0.9440
Epoch 7/25


In [76]:
y_prob = model.predict(X_test)

[1m537/537[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 3ms/step


In [77]:
y_pred = y_prob.argmax(axis=1)

In [78]:
from sklearn.metrics import accuracy_score
accuracy_score(y_test,y_pred)

0.6911302525899197

## **Gradio Deployment**

In [79]:
!pip3 install gradio



In [80]:
# defining prediction function


def predict_csat(channel_name, issue_responded_in_minutes, Item_price, Tenure_Bucket):
  # Scale numerical inputs using the pre-fitted scaler
  numerical_input = np.array([[issue_responded_in_minutes, Item_price]])
  scaled_numerical_input = scaler.transform(numerical_input)

  # Create a dictionary for the input and handle one-hot encoding for categorical features
  input_data = {
      'channel_name_Inbound': 0,
      'channel_name_Outcall': 0,
      'Tenure Bucket_31-60': 0,
      'Tenure Bucket_61-90': 0,
      'Tenure Bucket_>90': 0,
      'Tenure Bucket_On Job Training': 0,
      'issue_responded_in_minutes': scaled_numerical_input[0][0],
      'Item_price': scaled_numerical_input[0][1],
  }

  if channel_name == 'Inbound':
      input_data['channel_name_Inbound'] = 1
  elif channel_name == 'Outcall':
      input_data['channel_name_Outcall'] = 1

  if Tenure_Bucket == '31-60':
      input_data['Tenure Bucket_31-60'] = 1
  elif Tenure_Bucket == '61-90':
      input_data['Tenure Bucket_61-90'] = 1
  elif Tenure_Bucket == '>90':
      input_data['Tenure Bucket_>90'] = 1
  elif Tenure_Bucket == 'On Job Training':
      input_data['Tenure Bucket_On Job Training'] = 1

  # Convert the input dictionary to a pandas DataFrame with the same columns as the training data
  input_df = pd.DataFrame([input_data])
  input_df = input_df[X_train.columns] # Ensure the columns are in the same order

  # Make the prediction
  prediction = model.predict(input_df)
  predicted_class = np.argmax(prediction, axis=1)[0]

  return f"Predicted CSAT Score: {predicted_class}"

In [81]:
# Creating a gradio interface

import gradio as gr

iface = gr.Interface(
    fn=predict_csat,
    inputs=[
        gr.Dropdown(choices=['Inbound', 'Outbound', 'Outcall']),
        gr.Number(label="Issue Responded in Minutes"),
        gr.Number(label="Item Price"),
        gr.Dropdown(choices=['<30', '31-60', '61-90', '>90', 'On Job Training'])
    ],
    outputs="text",
    title="Customer Satisfaction Score Prediction",
    description="Predict the CSAT score based on customer interaction details."
)

iface.launch()


It looks like you are running Gradio on a hosted a Jupyter notebook. For the Gradio app to work, sharing must be enabled. Automatically setting `share=True` (you can turn this off by setting `share=False` in `launch()` explicitly).

Colab notebook detected. To show errors in colab notebook, set debug=True in launch()
* Running on public URL: https://f2f2c170efecc4d263.gradio.live

This share link expires in 1 week. For free permanent hosting and GPU upgrades, run `gradio deploy` from the terminal in the working directory to deploy to Hugging Face Spaces (https://huggingface.co/spaces)


