# Customer Segmentation and Recommendation System Project

## Introduction

In this project, we aim to enhance marketing strategies and increase sales for an online retail business by analyzing transactional data from a UK-based retailer. The dataset, which spans from 2010 to 2011, contains detailed information about customer purchases, including transaction dates, product descriptions, quantities, and prices.

### Project Objectives

1. **Customer Segmentation**: To group customers into distinct segments based on their purchasing behavior using the K-means clustering algorithm. This segmentation will allow the business to tailor marketing strategies to different customer groups effectively.

2. **Recommendation System**: To develop a recommendation system that suggests top-selling products to customers within each segment who have not yet purchased those items. This will help boost sales and enhance customer satisfaction by providing personalized product recommendations.

### Step 1: Data Understanding and Exploration

In the first step of this project, we will explore the dataset to understand its structure, identify any data quality issues, and gain insights into the key features that will be used for customer segmentation and recommendation. The main tasks in this step include:

- Loading the dataset into a pandas DataFrame.
- Performing an initial inspection to understand the structure and content of the data.
- Checking for missing values and data types to ensure data quality.
- Generating basic statistics to understand the distribution and characteristics of numerical and categorical features.

By thoroughly understanding the dataset, we will lay the foundation for effective data preprocessing, feature engineering, and model development in the subsequent steps of the project.

In [34]:
# Import necessary libraries
import pandas as pd

In [35]:
# Load the dataset
file_path = 'data/retail-data.csv'
df = pd.read_csv(file_path)

In [36]:
# Display the first few rows of the dataset to understand its structure
print("First 5 rows of the dataset:")
display(df.head())

First 5 rows of the dataset:


Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
0,536365,85123A,WHITE HANGING HEART T-LIGHT HOLDER,6,12/1/2010 8:26,2.55,17850.0,United Kingdom
1,536365,71053,WHITE METAL LANTERN,6,12/1/2010 8:26,3.39,17850.0,United Kingdom
2,536365,84406B,CREAM CUPID HEARTS COAT HANGER,8,12/1/2010 8:26,2.75,17850.0,United Kingdom
3,536365,84029G,KNITTED UNION FLAG HOT WATER BOTTLE,6,12/1/2010 8:26,3.39,17850.0,United Kingdom
4,536365,84029E,RED WOOLLY HOTTIE WHITE HEART.,6,12/1/2010 8:26,3.39,17850.0,United Kingdom


In [37]:
# Get the shape of the dataset to understand the number of rows and columns
print("\nDataset shape (rows, columns):", df.shape)


Dataset shape (rows, columns): (541909, 8)


In [38]:
# Check for missing values in each column
print("\nMissing values in each column:")
print(df.isnull().sum())


Missing values in each column:
InvoiceNo           0
StockCode           0
Description      1454
Quantity            0
InvoiceDate         0
UnitPrice           0
CustomerID     135080
Country             0
dtype: int64


In [39]:
# Check the data types of each column
print("\nData types of each column:")
print(df.dtypes)


Data types of each column:
InvoiceNo       object
StockCode       object
Description     object
Quantity         int64
InvoiceDate     object
UnitPrice      float64
CustomerID     float64
Country         object
dtype: object


In [40]:
# Get basic statistics of numeric columns
print("\nSummary statistics of numeric columns:")
display(df.describe())


Summary statistics of numeric columns:


Unnamed: 0,Quantity,UnitPrice,CustomerID
count,541909.0,541909.0,406829.0
mean,9.55225,4.611114,15287.69057
std,218.081158,96.759853,1713.600303
min,-80995.0,-11062.06,12346.0
25%,1.0,1.25,13953.0
50%,3.0,2.08,15152.0
75%,10.0,4.13,16791.0
max,80995.0,38970.0,18287.0


In [41]:
# Get basic statistics of categorical columns
print("\nSummary statistics of categorical columns:")
display(df.describe(include=['object']))


Summary statistics of categorical columns:


Unnamed: 0,InvoiceNo,StockCode,Description,InvoiceDate,Country
count,541909,541909,540455,541909,541909
unique,25900,4070,4223,23260,38
top,573585,85123A,WHITE HANGING HEART T-LIGHT HOLDER,10/31/2011 14:41,United Kingdom
freq,1114,2313,2369,1114,495478


### Step 2: Data Cleaning and Preprocessing

In this step, we will clean and preprocess the dataset based on the insights obtained during the data exploration phase. The main tasks include:

- **Handling missing values**: We will drop rows where `CustomerID` is missing, as these are critical for customer segmentation.
- **Removing duplicates**: We'll remove any duplicate transactions to ensure the integrity of the dataset.
- **Filtering out anomalies**: We'll remove rows with negative or zero `Quantity` and `UnitPrice`, as these values represent either returns, errors, or irrelevant data.
- **Converting data types**: We'll convert `InvoiceDate` to a datetime format, which is necessary for extracting time-based features later.
- **Removing unnecessary columns**: We'll remove the `Description` column since it's not useful for analysis or modeling.
- **Creating new features**: We'll create a `TotalPrice` feature, which will help in understanding customer monetary value for segmentation.

This preprocessing will prepare the dataset for customer segmentation and recommendation system modeling.

In [42]:
# 1. Remove the 'Description' column as it's not needed
df_cleaned = df.drop(columns=['Description'])

# 2. Handling missing values
# Remove rows where the CustomerID is missing (essential for customer-level analysis)
df_cleaned = df_cleaned.dropna(subset=['CustomerID'])

# 3. Removing duplicates
# Remove any duplicate rows to ensure unique transactions
df_cleaned = df_cleaned.drop_duplicates()

# 4. Filtering out anomalies
# Remove rows with negative or zero Quantity and UnitPrice (indicating returns or errors)
df_cleaned = df_cleaned[(df_cleaned['Quantity'] > 0) & (df_cleaned['UnitPrice'] > 0)]

# 5. Converting data types
# Convert InvoiceDate to datetime format for time-based analysis
df_cleaned['InvoiceDate'] = pd.to_datetime(df_cleaned['InvoiceDate'])

# 6. Creating new features
# Create a 'TotalPrice' feature = Quantity * UnitPrice
df_cleaned['TotalPrice'] = df_cleaned['Quantity'] * df_cleaned['UnitPrice']

In [43]:
# Display the cleaned dataset and summary of changes
print("Cleaned dataset shape (rows, columns):", df_cleaned.shape)
print("\nMissing values after cleaning:")
print(df_cleaned.isnull().sum())

Cleaned dataset shape (rows, columns): (392690, 8)

Missing values after cleaning:
InvoiceNo      0
StockCode      0
Quantity       0
InvoiceDate    0
UnitPrice      0
CustomerID     0
Country        0
TotalPrice     0
dtype: int64


In [44]:
# Display the first few rows of the cleaned dataset
print("\nFirst 5 rows of the cleaned dataset:")
display(df_cleaned.head())


First 5 rows of the cleaned dataset:


Unnamed: 0,InvoiceNo,StockCode,Quantity,InvoiceDate,UnitPrice,CustomerID,Country,TotalPrice
0,536365,85123A,6,2010-12-01 08:26:00,2.55,17850.0,United Kingdom,15.3
1,536365,71053,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom,20.34
2,536365,84406B,8,2010-12-01 08:26:00,2.75,17850.0,United Kingdom,22.0
3,536365,84029G,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom,20.34
4,536365,84029E,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom,20.34
