# Bigbasket Customer Analytics
#### Analyzing Customer Analytics for Enhanced Shopping Experiences

## About Dataset
Customer analytics plays a crucial role in the success of BigBasket, the popular e-commerce platform specializing in the online grocery segment. By leveraging customer data and employing analytics techniques, BigBasket gains valuable insights into customer behavior, preferences, and patterns, enabling them to enhance the shopping experience and drive customer satisfaction.

One key aspect of customer analytics for BigBasket is understanding customer preferences and purchase patterns -

    1. By analyzing data related to customer transactions, 
    2. browsing history, and search queries, 
    3. BigBasket can identify popular products, 
    4. frequently purchased items, and emerging trends. 
    
This information helps them optimize their product offerings, stock inventory accordingly, and tailor personalized recommendations to individual customers. By suggesting relevant products based on customer preferences, BigBasket increases the likelihood of repeat purchases and customer loyalty.

Customer analytics also helps BigBasket optimize their supply chain and logistics operations - 

    1. By analyzing order patterns, 
    2. delivery locations, and 
    3. delivery timings, 
BigBasket can optimize their delivery routes, reduce delivery times, and ensure efficient order fulfillment. This leads to improved customer satisfaction and reinforces BigBasket's reputation for reliable and timely deliveries.

Furthermore, customer analytics provides valuable insights for BigBasket's pricing strategies - 

    1. By analyzing customer purchasing patterns, 
    2. price sensitivity, and competitor pricing
    
BigBasket can optimize their pricing models to remain competitive while maximizing profitability. This ensures that customers perceive BigBasket as offering value for money, attracting more customers and boosting revenue.

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

#### IMPORT MANDATORY LIBRARIES 

In [None]:
import pandas as pd

# import dataprep
# from dataprep import eda
# from dataprep.eda import create_report
from ydata_profiling import ProfileReport
import seaborn as sns
import matplotlib.pyplot as plt

from itertools import combinations
# from collections import Counter
from collections import defaultdict, Counter

from sklearn.preprocessing import LabelEncoder
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.utils import to_categorical
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, SimpleRNN, Dense

import warnings
warnings.filterwarnings('ignore')

#### DATA UNDERSTANDING

In [None]:
member_df = pd.read_csv("/kaggle/input/bigbasket-customer-analytics/Memberdata.csv")
bigbasket_df = pd.read_csv("/kaggle/input/bigbasket-customer-analytics/bigBasket.csv")
desc_df = pd.read_csv("/kaggle/input/bigbasket-customer-analytics/IMB575-XLS-ENG.csv")

In [None]:
member_df.head()

In [None]:
bigbasket_df.tail(25)

In [None]:
desc_df.head()

In [None]:
print(member_df.shape)
print(bigbasket_df.shape)
print(desc_df.shape)

In [None]:
print(member_df.columns)
print(bigbasket_df.columns)

In [None]:
print(member_df.info())
print(bigbasket_df.info())
print(desc_df.info())

#### DATA PREPROCESSING

In [None]:
# Convert Created On: Handle mixed format (date string + Excel serial)
def convert_mixed_date(val):
    try:
        return pd.to_datetime(val)
    except:
        return pd.to_datetime(float(val), origin='1899-12-30', unit='d')

In [None]:
bigbasket_df['Created On'] = bigbasket_df['Created On'].apply(convert_mixed_date)
bigbasket_df['Order Date'] = bigbasket_df['Created On'].dt.date
bigbasket_df['Hour'] = bigbasket_df['Created On'].dt.hour

# member_df['Created On'] = member_df['Created On'].apply(convert_mixed_date)
# member_df['Order Date'] = member_df['Created On'].dt.date
# member_df['Hour'] = member_df['Created On'].dt.hour

#### EDA (using ProfileReport)

In [None]:
# create_report(df).show_browser()
# create_report(bigbasket_df)
ProfileReport(bigbasket_df, title="EDA Report", explorative=True, 
              correlations={
                        "pearson": {"calculate": True},
                        "cramers": {"calculate": True}      # Categorical vars
              }
             )

#### EXHAUSTIVE CUSTOMER DATA ANALYSIS

### A:
    1. By analyzing data related to customer transactions, 
    2. BigBasket can identify popular products, 
    3. frequently purchased items, and emerging trends. 

This information helps them optimize their product offerings, stock inventory accordingly, and tailor personalized recommendations to individual customers. By suggesting relevant products based on customer preferences, BigBasket increases the likelihood of repeat purchases and customer loyalty.

In [None]:
# Total orders per customer
orders_per_customer = bigbasket_df.groupby('Member')['Order'].nunique().reset_index(name='Total Orders')

plt.figure(figsize=(20, 5))
sns.barplot(data=orders_per_customer, x='Member', y='Total Orders', palette='Blues_d')
plt.title('Total Orders per Customer')
plt.ylabel('Number of Orders')
plt.xlabel('Customer (Member ID)')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

In [None]:
# Unique items bought per customer
unique_items = bigbasket_df.groupby('Member')['SKU'].nunique().reset_index(name='Unique SKUs')

plt.figure(figsize=(20, 5))
sns.barplot(data=unique_items, x='Member', y='Unique SKUs', palette='Greens_d')
plt.title('Unique Items Bought per Customer')
plt.ylabel('Number of Unique SKUs')
plt.xlabel('Customer (Member ID)')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()


In [None]:
# Average basket size (Average Items per Order)
basket_size = bigbasket_df.groupby(['Member', 'Order'])['SKU'].count().groupby('Member').mean().reset_index(name='Avg Basket Size')

plt.figure(figsize=(20, 5))
sns.barplot(data=basket_size, x='Member', y='Avg Basket Size', palette='Oranges_d')
plt.title('Average Basket Size per Customer')
plt.ylabel('Items per Order')
plt.xlabel('Customer (Member ID)')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

In [None]:
# Number of Transactions per Customer (Raw Count)
transaction_counts = bigbasket_df.groupby('Member')['Order'].count().reset_index(name='Total Transactions')

plt.figure(figsize=(20, 5))
sns.barplot(data=transaction_counts, x='Member', y='Total Transactions', palette='Purples_d')
plt.title('Total Transactions per Customer')
plt.ylabel('Count of Line Items')
plt.xlabel('Customer (Member ID)')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

### B:
Customer analytics also helps BigBasket optimize their supply chain and logistics operations -

    1. By analyzing order patterns, 
    2. delivery locations, and 
    3. delivery timings
    
BigBasket can optimize their delivery routes, reduce delivery times, and ensure efficient order fulfillment. This leads to improved customer satisfaction and reinforces BigBasket's reputation for reliable and timely deliveries.

In [None]:
# Order Trend Over Time
orders_over_time = bigbasket_df.groupby('Order Date')['Order'].nunique().reset_index(name='Unique Orders')

plt.figure(figsize=(20, 5))
sns.lineplot(data=orders_over_time, x='Order Date', y='Unique Orders', marker='o')
plt.title('Orders Over Time')
plt.ylabel('Number of Orders')
plt.xlabel('Date')
plt.tight_layout()
plt.show()

In [None]:
# Hourly Purchase Pattern
plt.figure(figsize=(7, 4))
sns.countplot(data=bigbasket_df, x='Hour', palette='Oranges')
plt.title('Orders by Hour of Day')
plt.xlabel('Hour')
plt.ylabel('Order Count')
plt.tight_layout()
plt.show()

In [None]:
# Get top 10 orders by item count
top_order_size = (
    bigbasket_df.groupby('Order')['SKU']
    .count()
    .reset_index(name='Items in Order')
    .sort_values('Items in Order', ascending=False)
    .head(10)
)

plt.figure(figsize=(8, 4))
sns.barplot(data=top_order_size, x='Order', y='Items in Order', palette='Blues_d')
plt.title('Top 10 Orders by Basket Size')
plt.ylabel('Number of Items')
plt.xlabel('Order ID')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

##### Insights:
    
    1. Basket size varies, with some orders having 6–7 items.
    2. The customer frequently purchases Glucose Biscuits, Banana, and Bread, showing repeated purchases across orders — helpful for inventory planning.
    3. Orders occurred around 9–10 AM, hinting at a potential morning delivery window preference.
    4. Products like Organic Flours, Beans, and Bread repeat across orders → good candidates for auto-recommendation or subscription models.

### C: 

Customer analytics provides valuable insights for BigBasket's pricing strategies -

    1. By analyzing customer purchasing patterns, 
    2. price sensitivity, and competitor pricing
   
BigBasket can optimize their pricing models to remain competitive while maximizing profitability. This ensures that customers perceive BigBasket as offering value for money, attracting more customers and boosting revenue.

In [None]:
# Repeat Purchases of the Same Product (Customer Loyalty Signal)
repeat_df = bigbasket_df.groupby(['Member', 'Description']).size().reset_index(name='Count')
repeat_df = repeat_df[repeat_df['Count'] > 1]

plt.figure(figsize=(10, 6))
sns.countplot(data=repeat_df, y='Description', order=repeat_df['Description'].value_counts().head(10).index, palette='Oranges_r')
plt.title('Products Frequently Reordered by Customers')
plt.xlabel('Number of Customers (Reordered)')
plt.ylabel('Product')
plt.tight_layout()
plt.show()

In [None]:
# Frequently Bought Together (Bundling Potential)

# Get product pairs from each order
order_groups = bigbasket_df.groupby('Order')['Description'].apply(list)

pairs = []
for items in order_groups:
    pairs.extend(combinations(sorted(set(items)), 2))

pair_counts = Counter(pairs)
top_pairs = pd.DataFrame(pair_counts.most_common(10), columns=['Pair', 'Frequency'])

# Convert tuple to string for plotting
top_pairs['Pair'] = top_pairs['Pair'].apply(lambda x: f'{x[0]} & {x[1]}')

plt.figure(figsize=(12, 6))
sns.barplot(data=top_pairs, x='Frequency', y='Pair', palette='Purples_d')
plt.title('Top 10 Frequently Bought Together Product Pairs')
plt.xlabel('Purchase Frequency')
plt.ylabel('Product Pair')
plt.tight_layout()
plt.show()


In [None]:
# Top Frequently Purchased Products (Pricing Leverage)
top_products = bigbasket_df['Description'].value_counts().reset_index()
top_products.columns = ['Product', 'Purchase Count']

plt.figure(figsize=(10, 6))
sns.barplot(data=top_products.head(10), y='Product', x='Purchase Count', palette='Greens_d')
plt.title('Top 10 Frequently Purchased Products')
plt.xlabel('Number of Purchases')
plt.ylabel('Product')
plt.tight_layout()
plt.show()

Here's a **clear and concise insights breakdown** for each of the **visual strategies** I provided earlier, under the theme of **customer analytics informing pricing strategies** at BigBasket.

---

## ✅ **Visual Strategy Breakdown — with Strategic Insights**

---

### 🔹 1. **Top Frequently Purchased Products**

**Visualization**: Bar chart of top 10 most purchased products (by description)

**🔍 What it shows**:

* Which categories or SKUs are most popular overall
* High-frequency purchases → indicate high demand & lower price sensitivity

**💡 Strategic Insight**:

* These products may **sustain a premium price** due to habitual/repeat use.
* **Dynamic pricing** strategies can be tested (e.g., gradual price increase on top items).
* Popular products can be **excluded from discounts** to maximize profit.

---

### 🔹 2. **Repeat Purchases by Customer**

**Visualization**: Countplot of most re-ordered products by customers

**🔍 What it shows**:

* Products that customers frequently buy multiple times
* Helps identify **loyalty-prone products** or essentials

**💡 Strategic Insight**:

* These are ideal for **subscription pricing**, loyalty rewards, or volume discounts.
* Consider **auto-delivery** offers or product refill reminders.
* Price sensitivity for such items might be **lower**—opportunity for **margin optimization**.

---

### 🔹 3. **Frequently Bought Together (Pairing Analysis)**

**Visualization**: Bar chart of most common product pairs purchased in the same order

**🔍 What it shows**:

* Commonly paired products → great for bundling strategies
* Implies **customer mental models** (e.g., “biscuits + tea”, “banana + milk”)

**💡 Strategic Insight**:

* Create **combo packs or bundle discounts** to increase average order value.
* Can drive **cross-category upsell** (e.g., pair a cheap item with a high-margin product).
* Helps optimize **product placement** on app/web for co-viewing.

---

### 🔹 4. **Delivery Time by Cluster (from route clustering)**

**Visualization**: Box plot of delivery time distributions across route clusters

**🔍 What it shows**:

* Variation in delivery time based on location clusters
* Clusters can represent **zones of demand density or delivery efficiency**

**💡 Strategic Insight**:

* Use for **route optimization**: clusters with higher delivery times may need re-routing or closer hubs.
* Inform **location-based delivery fees or SLAs**.
* Combine with frequency analysis to **prioritize logistics investment** in heavy-use areas.

---

### 🔹 5. **Order Hour Distribution**

**Visualization**: Bar chart of orders per hour of day

**🔍 What it shows**:

* Customer **purchase time preferences** during the day

**💡 Strategic Insight**:

* Time-based promotions can be aligned with **peak activity hours** (e.g., 9 AM offers).
* Useful for **staffing delivery teams** effectively.
* Can **personalize app experience** (e.g., change homepage content based on hour of login).

---

## 🧠 Executive Summary

| Visual                      | Insight                 | Strategic Use                                    |
| --------------------------- | ----------------------- | ------------------------------------------------ |
| 📦 Top Products             | Most popular categories | Identify pricing elasticity and margin potential |
| 🔁 Repeat Items             | Loyal buys              | Discount modeling, subscription offers           |
| 🔗 Product Pairs            | Bundling opportunities  | Upselling, cross-promo packaging                 |
| 📍 Clustered Delivery Times | Geo-efficiency          | Route optimization, zonal pricing                |
| ⏰ Order Timing              | Purchase patterns       | Timing of promos, staff delivery alignment       |

---


#### MODEL BUILDING

##### **Part 1 :** Predict Next Likely Purchase Category

**Goal:**
Predict the next product category a customer is likely to purchase based on past behavior.

In [None]:
# Sort customer's data by date/time
customer_df = bigbasket_df.copy()
customer_df['Created On'] = pd.to_datetime(customer_df['Created On'], errors='coerce')
customer_df = customer_df.sort_values(['Member', 'Created On'])

# Step 1: Find categories purchased in last order
latest_order_id = customer_df[customer_df['Member'] == 'M64379'].sort_values('Created On')['Order'].iloc[-1]
latest_order_items = customer_df[customer_df['Order'] == latest_order_id]['Description'].tolist()

# Step 2: Find all previous purchases (excluding latest)
prev_purchases = customer_df[customer_df['Order'] != latest_order_id]['Description'].tolist()

# Step 3: Count past frequencies excluding already purchased in latest order
counts = Counter([item for item in prev_purchases if item not in latest_order_items])
next_likely = counts.most_common(3)

print("🔮 Next likely categories to be purchased:")
for category, freq in next_likely:
    print(f"- {category} (seen {freq} times)")

#### Markov Chain Model – Lightweight, Interpretable
Predict the next likely product based on the last product(s) using transition probabilities.

In [None]:
# Assume df has: ['Member', 'Order', 'Description', 'Created On']
df = bigbasket_df.copy()
df['Created On'] = pd.to_datetime(df['Created On'])
df = df.sort_values(by=['Member', 'Created On'])

In [None]:
# Build transition matrix: Description → next Description
transitions = defaultdict(Counter)

# Group by user, treat each user's sequence as a chain
for _, group in df.groupby('Member'):
    sequence = group['Description'].tolist()
    for i in range(len(sequence) - 1):
        transitions[sequence[i]][sequence[i+1]] += 1

# Normalize to get probabilities
markov_model = {
    state: {next_state: count / sum(counts.values()) for next_state, count in counts.items()}
    for state, counts in transitions.items()
}

In [None]:
# Predict next product for a given last product
def predict_next_product(last_product, model=markov_model, top_k=3):
    if last_product not in model:
        return "No prediction (unseen product)"
    return sorted(model[last_product].items(), key=lambda x: x[1], reverse=True)[:top_k]

In [None]:
# Example usage
predict_next_product("Bread")

---


**--------------- by Sakshi Maharana -----------------------**

PS. Reviews, Comments, Discussion and Feedbacks are welcomed. This code was to focus on the exhaustive EDA and  building. Hope you liked it!!

**Upvotes** for my kaggle kernel code.

**THANK YOU!!**
