# **Cleaning and Preparation of data**

This dataset is derived from Amazon Customer Behavior Survey. It contains responses from 602 individuals and consists of 23 columns capturing various aspects of their shopping experience. The dataset primarily focuses on understanding customer preferences, behaviors, and satisfaction levels across different touchpoints in the shopping process.

Most of the data is categorical, indicating that respondents were likely given specific options to choose from (e.g., rating scales, frequencies), while other columns contain numerical data like age and ratings. Some responses are free text or selection-based (e.g., improvement areas). This dataset could be useful for understanding trends in customer behavior, identifying patterns in shopping frequency, and exploring the effectiveness of personalized recommendations or reviews in influencing purchase decisions.



Cleaning techniques

In this data cleaning process, regular expressions were used to standardize variations of the word "interface" into "User Interface". Additional techniques included merging similar categories, handling missing values by replacing . with NaN, creating a unique user_id for each row, and selecting relevant columns to form specialized dataframes (users_df and user_behaviour_df).

In [133]:
# Importing Packages
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [134]:
# Importing Files
df = pd.read_csv('../source/Amazon Customer Behavior Survey.csv')

In [135]:
df.head()

Unnamed: 0,Timestamp,age,Gender,Purchase_Frequency,Purchase_Categories,Personalized_Recommendation_Frequency,Browsing_Frequency,Product_Search_Method,Search_Result_Exploration,Customer_Reviews_Importance,...,Saveforlater_Frequency,Review_Left,Review_Reliability,Review_Helpfulness,Personalized_Recommendation_Frequency.1,Recommendation_Helpfulness,Rating_Accuracy,Shopping_Satisfaction,Service_Appreciation,Improvement_Areas
0,2023/06/04 1:28:19 PM GMT+5:30,23,Female,Few times a month,Beauty and Personal Care,Yes,Few times a week,Keyword,Multiple pages,1,...,Sometimes,Yes,Occasionally,Yes,2,Yes,1,1,Competitive prices,Reducing packaging waste
1,2023/06/04 2:30:44 PM GMT+5:30,23,Female,Once a month,Clothing and Fashion,Yes,Few times a month,Keyword,Multiple pages,1,...,Rarely,No,Heavily,Yes,2,Sometimes,3,2,Wide product selection,Reducing packaging waste
2,2023/06/04 5:04:56 PM GMT+5:30,24,Prefer not to say,Few times a month,Groceries and Gourmet Food;Clothing and Fashion,No,Few times a month,Keyword,Multiple pages,2,...,Rarely,No,Occasionally,No,4,No,3,3,Competitive prices,Product quality and accuracy
3,2023/06/04 5:13:00 PM GMT+5:30,24,Female,Once a month,Beauty and Personal Care;Clothing and Fashion;...,Sometimes,Few times a month,Keyword,First page,5,...,Sometimes,Yes,Heavily,Yes,3,Sometimes,3,4,Competitive prices,Product quality and accuracy
4,2023/06/04 5:28:06 PM GMT+5:30,22,Female,Less than once a month,Beauty and Personal Care;Clothing and Fashion,Yes,Few times a month,Filter,Multiple pages,1,...,Rarely,No,Heavily,Yes,4,Yes,2,2,Competitive prices,Product quality and accuracy


In [136]:
#Checking Column Names

df.columns

Index(['Timestamp', 'age', 'Gender', 'Purchase_Frequency',
       'Purchase_Categories', 'Personalized_Recommendation_Frequency',
       'Browsing_Frequency', 'Product_Search_Method',
       'Search_Result_Exploration', 'Customer_Reviews_Importance',
       'Add_to_Cart_Browsing', 'Cart_Completion_Frequency',
       'Cart_Abandonment_Factors', 'Saveforlater_Frequency', 'Review_Left',
       'Review_Reliability', 'Review_Helpfulness',
       'Personalized_Recommendation_Frequency ', 'Recommendation_Helpfulness',
       'Rating_Accuracy ', 'Shopping_Satisfaction', 'Service_Appreciation',
       'Improvement_Areas'],
      dtype='object')

In [137]:
#Checking Number of Rows and Columns

df.shape

(602, 23)

In [138]:
# Check specific info

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 602 entries, 0 to 601
Data columns (total 23 columns):
 #   Column                                  Non-Null Count  Dtype 
---  ------                                  --------------  ----- 
 0   Timestamp                               602 non-null    object
 1   age                                     602 non-null    int64 
 2   Gender                                  602 non-null    object
 3   Purchase_Frequency                      602 non-null    object
 4   Purchase_Categories                     602 non-null    object
 5   Personalized_Recommendation_Frequency   602 non-null    object
 6   Browsing_Frequency                      602 non-null    object
 7   Product_Search_Method                   600 non-null    object
 8   Search_Result_Exploration               602 non-null    object
 9   Customer_Reviews_Importance             602 non-null    int64 
 10  Add_to_Cart_Browsing                    602 non-null    object
 11  Cart_C

In [139]:
for column in df.columns:
    print(f"Unique values in '{column}':")
    print(df[column].value_counts(dropna=False))  # Includes NaN values in the counts
    print("\n")

Unique values in 'Timestamp':
Timestamp
2023/06/07 11:47:44 AM GMT+5:30    2
2023/06/12 3:48:11 PM GMT+5:30     1
2023/06/04 7:33:12 PM GMT+5:30     1
2023/06/16 9:16:05 AM GMT+5:30     1
2023/06/04 1:28:19 PM GMT+5:30     1
                                  ..
2023/06/12 4:02:02 PM GMT+5:30     1
2023/06/12 4:02:53 PM GMT+5:30     1
2023/06/12 4:03:59 PM GMT+5:30     1
2023/06/12 9:57:20 PM GMT+5:30     1
2023/06/12 3:49:27 PM GMT+5:30     1
Name: count, Length: 601, dtype: int64


Unique values in 'age':
age
23    123
34     48
24     40
25     36
45     34
21     30
26     27
32     19
22     17
27     17
36     16
35     15
37     14
46     12
40     12
28      9
31      9
29      9
30      8
56      8
33      7
18      7
54      6
47      6
43      6
16      5
50      5
44      5
20      5
19      4
17      4
38      4
42      4
39      4
41      4
48      3
60      3
53      2
67      2
57      2
15      2
64      1
63      1
58      1
55      1
62      1
52      1
49      1
3   

In [140]:
# Rename the first occurrence of 'Personalized_Recommendation_Frequency'
df.rename(columns={'Personalized_Recommendation_Frequency': 'Personalized_Recommendation_Success'}, inplace=True)

# Remove 'GMT' and timezone offset
df['Timestamp'] = pd.to_datetime(df['Timestamp'])



  df['Timestamp'] = pd.to_datetime(df['Timestamp'])


In [141]:
# Cleaning column 'Service_Appreciation'

# Strip any extra whitespace to ensure matching values
df['Service_Appreciation'] = df['Service_Appreciation'].str.strip()

# Combine the two "Customer service" entries
df['Service_Appreciation'] = df['Service_Appreciation'].replace('Customer service', 'Customer service')

# Replace "." with None (or NaN in pandas)
df['Service_Appreciation'] = df['Service_Appreciation'].replace('.', np.nan)

# Check the result
print(df['Service_Appreciation'].value_counts(dropna=False))


Service_Appreciation
Product recommendations                185
Competitive prices                     182
Wide product selection                 150
User-friendly website/app interface     80
Customer service                         2
NaN                                      1
Quick delivery                           1
All the above                            1
Name: count, dtype: int64


In [142]:
import numpy as np

# Replace "." with NaN
df['Improvement_Areas'] = df['Improvement_Areas'].replace('.', np.nan)

# Standardize "User interface" and "UI" to "User Interface"
df['Improvement_Areas'] = df['Improvement_Areas'].replace(['UI'], 'User Interface')
df['Improvement_Areas'] = df['Improvement_Areas'].replace(
    to_replace=r'.*interface.*', value='User Interface', regex=True
)

# Standardize phrases for no problems with Amazon
df['Improvement_Areas'] = df['Improvement_Areas'].replace(
    ['No problems with Amazon', 'I don\'t have any problem with Amazon', 'I have no problem with Amazon yet. But others tell me about the refund issues', 'Nil', 'Nothing','I have no problem with Amazon yet. But others tell me about the refund issues '],
    np.nan
)

# Check the result
print(df['Improvement_Areas'].value_counts(dropna=False))


Improvement_Areas
Customer service responsiveness                                  217
Product quality and accuracy                                     159
Reducing packaging waste                                         133
Shipping speed and reliability                                    79
NaN                                                                6
User Interface                                                     4
Add more familiar brands to the list                               1
Scrolling option would be much better than going to next page      1
Quality of product is very poor according to the big offers        1
Irrelevant product suggestions                                     1
Name: count, dtype: int64


In [143]:
# Create a new column 'user_id' with sequential values from 1 to the length of the dataframe
df['user_id'] = range(1, len(df) + 1)

## Users Table

In [144]:
# Create users_df with only 'user_id', 'age', and 'gender' columns
users_df = df[['user_id', 'age', 'Gender']]

# Rename 'Gender' column to 'gender'
users_df = users_df.rename(columns={
    'Gender': 'gender'
})


In [145]:
users_df.to_csv('data/users.csv',index=False)

## User_Behaviour Table

In [146]:
# Create user_behaviour_df with 'user_id' and the remaining columns (excluding 'age' and 'gender')
user_behaviour_df = df.drop(columns=['age', 'Gender'])

In [147]:
# Rename the columns to match the table schema
user_behaviour_df = user_behaviour_df.rename(columns={
    'Timestamp': 'timestamp',
    'Purchase_Frequency': 'purchase_Frequency',
    'Purchase_Categories': 'purchase_Categories',
    'Personalized_Recommendation_Success': 'personalized_Recommendation_Success',
    'Browsing_Frequency': 'browsing_Frequency',
    'Product_Search_Method': 'product_Search_Method',
    'Search_Result_Exploration': 'search_Result_Exploration',
    'Customer_Reviews_Importance': 'customer_Reviews_Importance',
    'Add_to_Cart_Browsing': 'add_to_Cart_Browsing',
    'Cart_Completion_Frequency': 'cart_Completion_Frequency',
    'Cart_Abandonment_Factors': 'cart_Abandonment_Factors',
    'Saveforlater_Frequency': 'saveforlater_Frequency',
    'Review_Left': 'review_Left',
    'Review_Reliability': 'review_Reliability',
    'Review_Helpfulness': 'review_Helpfulness',
    'Personalized_Recommendation_Frequency ': 'personalized_Recommendation_Frequency',
    'Recommendation_Helpfulness': 'recommendation_Helpfulness',
    'Rating_Accuracy ': 'rating_Accuracy',
    'Shopping_Satisfaction': 'shopping_Satisfaction',
    'Service_Appreciation': 'service_Appreciation',
    'Improvement_Areas': 'improvement_Areas',
    'user_id': 'user_id'
})

In [148]:
user_behaviour_df.to_csv('data/user_behaviour.csv',index=False)