I have worked on analyzing a dataset of over 20,000 sales records from an Amazon-like e-commerce platform. This project involves extensive querying of customer behavior, product performance, and sales trends using PostgreSQL. Through this project, I have tackled various SQL problems, including revenue analysis, customer segmentation, and inventory management.
The project also focuses on data cleaning, handling null values, and solving real-world business problems using structured queries.
An ERD diagram is included to visually represent the database schema and relationships between tables.
CREATE TABLE category
(
category_id INT PRIMARY KEY,
category_name VARCHAR(20)
);
-- customers TABLE
CREATE TABLE customers
(
customer_id INT PRIMARY KEY,
first_name VARCHAR(20),
last_name VARCHAR(20),
state VARCHAR(20),
address VARCHAR(5) DEFAULT ('xxxx')
);
-- sellers TABLE
CREATE TABLE sellers
(
seller_id INT PRIMARY KEY,
seller_name VARCHAR(25),
origin VARCHAR(15)
);
-- products table
CREATE TABLE products
(
product_id INT PRIMARY KEY,
product_name VARCHAR(50),
price FLOAT,
cogs FLOAT,
category_id INT, -- FK
CONSTRAINT product_fk_category FOREIGN KEY(category_id) REFERENCES category(category_id)
);
-- orders
CREATE TABLE orders
(
order_id INT PRIMARY KEY,
order_date DATE,
customer_id INT, -- FK
seller_id INT, -- FK
order_status VARCHAR(15),
CONSTRAINT orders_fk_customers FOREIGN KEY (customer_id) REFERENCES customers(customer_id),
CONSTRAINT orders_fk_sellers FOREIGN KEY (seller_id) REFERENCES sellers(seller_id)
);
CREATE TABLE order_items
(
order_item_id INT PRIMARY KEY,
order_id INT, -- FK
product_id INT, -- FK
quantity INT,
price_per_unit FLOAT,
CONSTRAINT order_items_fk_orders FOREIGN KEY (order_id) REFERENCES orders(order_id),
CONSTRAINT order_items_fk_products FOREIGN KEY (product_id) REFERENCES products(product_id)
);
-- payment TABLE
CREATE TABLE payments
(
payment_id
INT PRIMARY KEY,
order_id INT, -- FK
payment_date DATE,
payment_status VARCHAR(20),
CONSTRAINT payments_fk_orders FOREIGN KEY (order_id) REFERENCES orders(order_id)
);
CREATE TABLE shippings
(
shipping_id INT PRIMARY KEY,
order_id INT, -- FK
shipping_date DATE,
return_date DATE,
shipping_providers VARCHAR(15),
delivery_status VARCHAR(15),
CONSTRAINT shippings_fk_orders FOREIGN KEY (order_id) REFERENCES orders(order_id)
);
CREATE TABLE inventory
(
inventory_id INT PRIMARY KEY,
product_id INT, -- FK
stock INT,
warehouse_id INT,
last_stock_date DATE,
CONSTRAINT inventory_fk_products FOREIGN KEY (product_id) REFERENCES products(product_id)
);
I cleaned the dataset by:
- Removing duplicates: Duplicates in the customer and order tables were identified and removed.
- Handling missing values: Null values in critical fields (e.g., customer address, payment status) were either filled with default values or handled using appropriate methods.
Null values were handled based on their context:
- Customer addresses: Missing addresses were assigned default placeholder values.
- Payment statuses: Orders with null payment statuses were categorized as “Pending.”
- Shipping information: Null return dates were left as is, as not all shipments are returned.
The primary objective of this project is to showcase SQL proficiency through complex queries that address real-world e-commerce business challenges. The analysis covers various aspects of e-commerce operations, including:
- Customer behavior
- Sales trends
- Inventory management
- Payment and shipping analysis
- Forecasting and product performance
Key business problems identified:
- Low product availability due to inconsistent restocking.
- High return rates for specific product categories.
- Significant delays in shipments and inconsistencies in delivery times.
- High customer acquisition costs with a low customer retention rate.
- Top Selling Products Query the top 10 products by total sales value. Challenge: Include product name, total quantity sold, and total sales value.
SELECT
oi.product_id,
p.product_name,
ROUND(SUM(oi.total_sale)::numeric, 2) AS total_sale,
COUNT(o.order_id) AS total_orders
FROM orders AS o
JOIN order_items AS oi ON oi.order_id = o.order_id
JOIN products AS p ON p.product_id = oi.product_id
GROUP BY oi.product_id, p.product_name
ORDER BY total_sale DESC
LIMIT 10;
- Revenue by Category Calculate total revenue generated by each product category. Challenge: Include the percentage contribution of each category to total revenue.
SELECT
c.category_id,
c.category_name,
ROUND(SUM(oi.total_sale)::numeric, 2) AS total_sale,
ROUND(
(SUM(oi.total_sale)::numeric /
(SELECT SUM(total_sale)::numeric FROM order_items)
) * 100,
2
) AS contribution
FROM order_items AS oi
JOIN products AS p ON p.product_id = oi.product_id
LEFT JOIN category AS c ON c.category_id = p.category_id
GROUP BY c.category_id, c.category_name
ORDER BY total_sale DESC;
- Average Order Value (AOV) Compute the average order value for each customer. Challenge: Include only customers with more than 5 orders.
SELECT
c.customer_id,
CONCAT(c.first_name, ' ', c.last_name) AS full_name,
ROUND(SUM(oi.total_sale)::numeric / COUNT(o.order_id),2) AS AOV
FROM orders AS o
JOIN customers AS c ON c.customer_id = o.customer_id
JOIN order_items AS oi ON oi.order_id = o.order_id
GROUP BY c.customer_id, full_name
HAVING COUNT(o.order_id) > 5;
- Monthly Sales Trend Query monthly total sales over the past year. Challenge: Display the sales trend, grouping by month, return current_month sale, last month sale!
SELECT
year,
month,
total_sale as current_month_sale,
LAG(total_sale, 1) OVER(ORDER BY year, month) as last_month_sale
FROM ---
(
SELECT
EXTRACT(MONTH FROM o.order_date) as month,
EXTRACT(YEAR FROM o.order_date) as year,
ROUND(
SUM(oi.total_sale::numeric)
,2) as total_sale
FROM orders as o
JOIN
order_items as oi
ON oi.order_id = o.order_id
WHERE o.order_date >= CURRENT_DATE - INTERVAL '1 year'
GROUP BY 1, 2
ORDER BY year, month
) as t1
- Customers with No Purchases Find customers who have registered but never placed an order. Challenge: List customer details and the time since their registration.
SELECT
c.customer_id,
CONCAT(c.first_name, ' ', c.last_name) AS full_name,
c.state
FROM customers AS c
LEFT JOIN
orders AS o
ON o.customer_id = c.customer_id
WHERE o.customer_id IS NULL;
- Least-Selling Categories by State Identify the least-selling product category for each state. Challenge: Include the total sales for that category within each state.
WITH ranking_table AS (
SELECT
c.state,
cat.category_name,
ROUND(SUM(oi.total_sale)::numeric, 2) AS total_sales,
RANK() OVER (PARTITION BY c.state ORDER BY SUM(oi.total_sale) ASC) AS rank
FROM orders AS o
JOIN customers AS c ON o.customer_id = c.customer_id
JOIN order_items AS oi ON o.order_id = oi.order_id
JOIN products AS p ON oi.product_id = p.product_id
JOIN category AS cat ON cat.category_id = p.category_id
GROUP BY c.state, cat.category_name
)
SELECT
state,
category_name,
total_sales
FROM ranking_table
WHERE rank = 1;
- Customer Lifetime Value (CLTV) Calculate the total value of orders placed by each customer over their lifetime. Challenge: Rank customers based on their CLTV.
SELECT
c.customer_id,
CONCAT(c.first_name, ' ', c.last_name) AS full_name,
ROUND(SUM(oi.total_sale)::numeric, 2) AS CLTV,
DENSE_RANK() OVER (ORDER BY SUM(oi.total_sale) DESC) AS cust_ranking
FROM orders AS o
JOIN customers AS c ON o.customer_id = c.customer_id
JOIN order_items AS oi ON oi.order_id = o.order_id
GROUP BY c.customer_id, full_name
ORDER BY cust_ranking;
- Inventory Stock Alerts Query products with stock levels below a certain threshold (e.g., less than 10 units). Challenge: Include last restock date and warehouse information.
SELECT
i.inventory_id,
p.product_name,
i.stock AS current_stock_left,
i.last_stock_date,
i.warehouse_id
FROM inventory AS i
JOIN products AS p ON p.product_id = i.product_id
WHERE stock < 10
- Shipping Delays Identify orders where the shipping date is later than 3 days after the order date. Challenge: Include customer, order details, and delivery provider.
SELECT
c.*,
o.*,
s.shipping_providers
FROM orders AS o
JOIN customers AS c ON c.customer_id = o.customer_id
JOIN shippings AS s ON o.order_id = s.order_id
WHERE s.shipping_date > o.order_date + INTERVAL '3 day';
- Payment Success Rate Calculate the percentage of successful payments across all orders. Challenge: Include breakdowns by payment status (e.g., failed, pending).
SELECT
p.payment_status,
COUNT(*) AS total_cnt,
ROUND(COUNT(*)::numeric / (SELECT COUNT(*) FROM payments)::numeric * 100, 2) AS percentage
FROM orders AS o
JOIN payments AS p ON o.order_id = p.order_id
GROUP BY p.payment_status;
- Top Performing Sellers Find the top 5 sellers based on total sales value. Challenge: Include both successful and failed orders, and display their percentage of successful orders.
WITH top_sellers AS (
SELECT
s.seller_id,
s.seller_name,
ROUND(SUM(oi.total_sale)::numeric, 2) AS total_sale
FROM orders AS o
JOIN sellers AS s ON o.seller_id = s.seller_id
JOIN order_items AS oi ON o.order_id = oi.order_id
GROUP BY s.seller_id, s.seller_name
ORDER BY total_sale DESC
LIMIT 5
),
sellers_reports AS (
SELECT
o.seller_id,
tp.seller_name,
o.order_status,
COUNT(*) AS total_orders
FROM orders AS o
JOIN top_sellers AS tp ON o.seller_id = tp.seller_id
WHERE o.order_status NOT IN ('Inprogress', 'Returned')
GROUP BY o.seller_id, tp.seller_name, o.order_status
)
SELECT
seller_id,
seller_name,
SUM(CASE WHEN order_status = 'Completed' THEN total_orders ELSE 0 END) AS completed_orders,
SUM(CASE WHEN order_status = 'Cancelled' THEN total_orders ELSE 0 END) AS cancelled_orders,
SUM(total_orders) AS total_orders,
ROUND(
COALESCE(SUM(CASE WHEN order_status = 'Completed' THEN total_orders ELSE 0 END)::numeric
/ NULLIF(SUM(total_orders), 0)::numeric * 100, 0),
2
) AS perc_successful_orders
FROM sellers_reports
GROUP BY seller_id, seller_name
ORDER BY perc_successful_orders DESC;
- Product Profit Margin Calculate the profit margin for each product (difference between price and cost of goods sold). Challenge: Rank products by their profit margin, showing highest to lowest. */
SELECT
p.product_id,
p.product_name,
ROUND(
(
SUM(oi.total_sale - (p.cogs * oi.quantity))::numeric
/ NULLIF(SUM(oi.total_sale), 0)::numeric
) * 100,
2
) AS profit_margin
FROM order_items AS oi
JOIN products AS p ON oi.product_id = p.product_id
GROUP BY p.product_id, p.product_name
ORDER BY profit_margin DESC;
- Most Returned Products Query the top 10 products by the number of returns. Challenge: Display the return rate as a percentage of total units sold for each product.
SELECT
p.product_id,
p.product_name,
COUNT(*) AS total_unit_sold,
SUM(CASE WHEN o.order_status = 'Returned' THEN 1 ELSE 0 END) AS total_returned,
ROUND(
SUM(CASE WHEN o.order_status = 'Returned' THEN 1 ELSE 0 END)::numeric
/ NULLIF(COUNT(*), 0)::numeric,
2
) * 100 AS perc_tot_ret
FROM order_items AS oi
JOIN products AS p ON oi.product_id = p.product_id
JOIN orders AS o ON oi.order_id = o.order_id
GROUP BY p.product_id, p.product_name
ORDER BY perc_tot_ret DESC
LIMIT 10;
- Inactive Sellers Identify sellers who haven’t made any sales in the last 6 months. Challenge: Show the last sale date and total sales from those sellers.
WITH cte1 AS (
SELECT
s.seller_id,
s.seller_name
FROM sellers AS s
WHERE NOT EXISTS (
SELECT 1
FROM orders AS o
WHERE o.seller_id = s.seller_id
AND o.order_date >= CURRENT_DATE - INTERVAL '6 month'
)
)
SELECT
cte1.seller_id,
cte1.seller_name,
MAX(o.order_date) AS last_sale_date,
ROUND(SUM(oi.total_sale)::numeric, 2) AS total_sales
FROM cte1
JOIN orders AS o
ON cte1.seller_id = o.seller_id
AND o.order_date < CURRENT_DATE - INTERVAL '6 month'
JOIN order_items AS oi
ON o.order_id = oi.order_id
GROUP BY cte1.seller_id, cte1.seller_name
HAVING SUM(oi.total_sale) > 0
ORDER BY last_sale_date DESC;
- IDENTITY customers into returning or new if the customer has done more than 5 return categorize them as returning otherwise new Challenge: List customers id, name, total orders, total returns
SELECT
cte.customer_id,
cte.full_name AS customers,
cte.total_orders,
cte.total_return,
CASE WHEN cte.total_return > 5 THEN 'Returning customer' ELSE 'NEW' END AS customer_category
FROM (
SELECT
c.customer_id,
CONCAT(c.first_name, ' ', c.last_name) AS full_name,
COUNT(o.order_id) AS total_orders,
SUM(CASE WHEN o.order_status = 'Returned' THEN 1 ELSE 0 END) AS total_return
FROM customers AS c
JOIN orders AS o ON c.customer_id = o.customer_id
JOIN shippings AS s ON o.order_id = s.order_id -- Fixed JOIN condition
GROUP BY c.customer_id, full_name
) AS cte;
- Top 5 Customers by Orders in Each State Identify the top 5 customers with the highest number of orders for each state. Challenge: Include the number of orders and total sales for each customer.
SELECT *
FROM (
SELECT
c.state,
CONCAT(c.first_name, ' ', c.last_name) AS full_name,
COUNT(o.order_id) AS total_orders,
ROUND(SUM(oi.total_sale)::numeric, 2) AS total_sale,
DENSE_RANK() OVER(PARTITION BY c.state ORDER BY COUNT(o.order_id) DESC) AS rank
FROM orders AS o
JOIN order_items AS oi ON o.order_id = oi.order_id
JOIN customers AS c ON o.customer_id = c.customer_id
GROUP BY c.state, full_name
) AS ranked_customers
WHERE rank <= 5;
- Revenue by Shipping Provider Calculate the total revenue handled by each shipping provider. Challenge: Include the total number of orders handled and the average delivery time for each provider.
SELECT
s.shipping_providers,
COUNT(o.order_id) AS order_handled,
ROUND(SUM(oi.total_sale)::numeric, 2) AS total_sale,
COALESCE(ROUND(AVG(s.return_date - s.shipping_date)::numeric, 2), 0) AS average_days
FROM orders AS o
JOIN order_items AS oi ON oi.order_id = o.order_id
JOIN shippings AS s ON s.order_id = o.order_id
GROUP BY s.shipping_providers;
- Top 10 product with highest decreasing revenue ratio compare to year 2022 and year 2023 Challenge: Return product_id, product_name, category_name, 2022 revenue and 2023 revenue decrease ratio at end Round the result Note: Decrease ratio = y23-y22/y22* 100
WITH year_2022 AS (
SELECT
p.product_id,
p.product_name,
c.category_name,
ROUND(SUM(oi.total_sale)::numeric, 2) AS revenue
FROM orders AS o
JOIN order_items AS oi ON oi.order_id = o.order_id
JOIN products AS p ON p.product_id = oi.product_id
JOIN category AS c ON c.category_id = p.category_id
WHERE EXTRACT(YEAR FROM o.order_date) = 2022
GROUP BY p.product_id, p.product_name, c.category_name
),
year_2023 AS (
SELECT
p.product_id,
p.product_name,
c.category_name,
ROUND(SUM(oi.total_sale)::numeric, 2) AS revenue
FROM orders AS o
JOIN order_items AS oi ON oi.order_id = o.order_id
JOIN products AS p ON p.product_id = oi.product_id
JOIN category AS c ON c.category_id = p.category_id
WHERE EXTRACT(YEAR FROM o.order_date) = 2023
GROUP BY p.product_id, p.product_name, c.category_name
)
SELECT
y22.product_id,
y22.product_name,
y22.category_name,
y22.revenue AS revenues_22,
y23.revenue AS revenues_23,
ROUND(((y23.revenue - y22.revenue) / y22.revenue) * 100, 2) AS revenue_dec_ratio
FROM year_2022 AS y22
JOIN year_2023 AS y23 ON y22.product_id = y23.product_id
WHERE y22.revenue > y23.revenue
ORDER BY revenue_dec_ratio ASC
LIMIT 10;
This project enabled me to:
- Design and implement a normalized database schema.
- Clean and preprocess real-world datasets for analysis.
- Use advanced SQL techniques, including window functions, subqueries, and joins.
- Conduct in-depth business analysis using SQL.
- Optimize query performance and handle large datasets efficiently.
This advanced SQL project successfully demonstrates my ability to solve real-world e-commerce problems using structured queries. From improving customer retention to optimizing inventory and logistics, the project provides valuable insights into operational challenges and solutions.
By completing this project, I have gained a deeper understanding of how SQL can be used to tackle complex data problems and drive business decision-making.