## Spark SQL Exploratory Data Analysis (EDA) Project

Overview
This script contains a collection of Spark SQL queries designed to perform exploratory data analysis (EDA) on a sales database within the Databricks environment. The queries leverage Spark's distributed computing to uncover key insights about customers, products, sales, and orders, providing a comprehensive understanding of the business's performance and trends.

Features
- **Database Schema Exploration**: Queries to retrieve all tables and columns from the Delta Lake using Spark SQL's metadata capabilities.
- **Customer Analysis**:
  - Identify unique countries of customers.
  - Calculate the total number of customers and their distribution by country and gender.
  - Determine the youngest and oldest customers based on birthdate.
- **Product Analysis**:
  - List all product categories, subcategories, and product names.
  - Calculate the total number of products and their distribution by category.
  - Compute average product costs per category.
- **Sales Analysis**:
  - Determine the date range of sales and the number of years covered.
  - Calculate total sales, total quantity sold, average selling price, and total orders using Spark's aggregation functions.
  - Analyze revenue by product category and customer.
  - Identify the distribution of sold items across countries.
- **Rankings and Performance**:
  - Rank the top 5 products by revenue using both simple and Spark SQL window function-based approaches.
  - Identify the 5 worst-performing products by sales.
  - List the top 10 customers by revenue and the 3 customers with the fewest orders.

Database Schema
The queries assume a Delta Lake with the following key tables in the `gold` schema, optimized for Spark SQL:
- `dim_customers`: Contains customer information (e.g., customer_key, first_name, last_name, country, gender, birthdate).
- `dim_products`: Contains product details (e.g., product_key, product_name, category, subcategory, cost).
- `fact_sales`: Contains sales transactions (e.g., order_number, order_date, customer_key, product_key, sales_amount, quantity, price).

Usage
The queries are written in Spark SQL, optimized for execution on Databricks, leveraging its distributed processing and Delta Lake storage. They utilize functions like `EXTRACT`, `SUM`, `COUNT`, `AVG`, and window functions (`RANK`).
Ensure the Delta Lake contains the required tables (`gold.dim_customers`, `gold.dim_products`, `gold.fact_sales`) with the specified columns.
Run the queries directly in a Databricks notebook using the `%sql` magic command or via PySpark with `spark.sql()` and `display()`.

Key Queries
1. **Schema Exploration**: Retrieve all tables and columns in the Delta Lake using Spark SQL metadata.
2. **Customer Insights**: Analyze customer demographics (countries, gender, age) with Spark's distributed queries.
3. **Product Insights**: Explore product categories, subcategories, and costs using Spark SQL aggregations.
4. **Sales Metrics**: Calculate total sales, quantities, average prices, and order counts with Spark's optimization.
5. **Revenue Analysis**: Break down revenue by category, customer, and country using distributed joins and aggregations.
6. **Performance Rankings**: Identify top and bottom-performing products and customers with Spark SQL window functions.

Notes
The queries are designed for a clean and normalized Delta Lake. Ensure data integrity (e.g., no missing or null values in key columns) for accurate results.
Some queries use `LEFT JOIN` to handle potential missing relationships between tables, optimized for Spark's execution engine.
The `EXTRACT` function is used for date calculations, compatible with Spark SQL. Adjust for alternative date functions if needed (e.g., `DATEDIFF` for specific use cases).

License
This project is licensed under the MIT License. Feel free to use, modify, and distribute the queries as needed.

In [0]:
## Explore All objects in the DATABASE

df = spark.sql("""SELECT * FROM dwh_project.information_schema.TABLES""")
display(df)

table_catalog,table_schema,table_name,table_type,is_insertable_into,commit_action,table_owner,comment,created,created_by,last_altered,last_altered_by,data_source_format,storage_sub_directory,storage_path
dwh_project,information_schema,table_tags,VIEW,NO,PRESERVE,System user,,2025-08-29T10:26:07.141Z,System user,2025-08-29T10:26:07.141Z,System user,UNKNOWN_DATA_SOURCE_FORMAT,,
dwh_project,silver,crm_sales_details,MANAGED,YES,PRESERVE,sergitkeshelashvili@gmail.com,,2025-08-31T18:36:21.145Z,sergitkeshelashvili@gmail.com,2025-08-31T18:36:21.911Z,sergitkeshelashvili@gmail.com,DELTA,tables/103bdec7-2c59-4ec1-a624-5cde1872ecc3,
dwh_project,information_schema,columns,VIEW,NO,PRESERVE,System user,Describes columns of tables and views in the catalog.,2025-08-29T10:26:06.690Z,System user,2025-08-29T10:26:06.690Z,System user,UNKNOWN_DATA_SOURCE_FORMAT,,
dwh_project,information_schema,volume_privileges,VIEW,NO,PRESERVE,System user,,2025-08-29T10:26:07.084Z,System user,2025-08-29T10:26:07.084Z,System user,UNKNOWN_DATA_SOURCE_FORMAT,,
dwh_project,information_schema,schemata,VIEW,NO,PRESERVE,System user,Describes schemas within the catalog.,2025-08-29T10:26:06.857Z,System user,2025-08-29T10:26:06.857Z,System user,UNKNOWN_DATA_SOURCE_FORMAT,,
dwh_project,bronze,erp_loc_a101,MANAGED,YES,PRESERVE,sergitkeshelashvili@gmail.com,,2025-08-29T10:54:37.335Z,sergitkeshelashvili@gmail.com,2025-08-29T10:54:37.989Z,sergitkeshelashvili@gmail.com,DELTA,tables/25cf89e3-f05f-49e2-9d70-36ee72dce497,
dwh_project,information_schema,catalog_privileges,VIEW,NO,PRESERVE,System user,Lists principals that have privileges on the catalogs.,2025-08-29T10:26:06.630Z,System user,2025-08-29T10:26:06.630Z,System user,UNKNOWN_DATA_SOURCE_FORMAT,,
dwh_project,information_schema,volume_tags,VIEW,NO,PRESERVE,System user,,2025-08-29T10:26:07.179Z,System user,2025-08-29T10:26:07.179Z,System user,UNKNOWN_DATA_SOURCE_FORMAT,,
dwh_project,information_schema,tables,VIEW,NO,PRESERVE,System user,Describes tables and views defined within the catalog.,2025-08-29T10:26:06.984Z,System user,2025-08-29T10:26:06.984Z,System user,UNKNOWN_DATA_SOURCE_FORMAT,,
dwh_project,information_schema,constraint_column_usage,VIEW,NO,PRESERVE,System user,Describes the constraints referencing columns in the catalog.,2025-08-29T10:26:06.921Z,System user,2025-08-29T10:26:06.921Z,System user,UNKNOWN_DATA_SOURCE_FORMAT,,


In [0]:

## Explore All Columns in the Database

df = spark.sql("""SELECT * FROM dhw_project.information_schema.COLUMNS""")
display(df)

table_catalog,table_schema,table_name,column_name,ordinal_position,column_default,is_nullable,full_data_type,data_type,character_maximum_length,character_octet_length,numeric_precision,numeric_precision_radix,numeric_scale,datetime_precision,interval_type,interval_precision,maximum_cardinality,is_identity,identity_generation,identity_start,identity_increment,identity_maximum,identity_minimum,identity_cycle,is_generated,generation_expression,is_system_time_period_start,is_system_time_period_end,system_time_period_timestamp_generation,is_updatable,partition_index,comment
dhw_project,information_schema,catalog_privileges,grantor,0,,NO,string,STRING,0.0,0.0,,,,,,,,NO,,,,,,,NO,,NO,NO,,YES,,Principal that granted the privilege.
dhw_project,information_schema,catalog_privileges,grantee,1,,NO,string,STRING,0.0,0.0,,,,,,,,NO,,,,,,,NO,,NO,NO,,YES,,Principal to which the privilege is granted.
dhw_project,information_schema,catalog_privileges,catalog_name,2,,NO,string,STRING,0.0,0.0,,,,,,,,NO,,,,,,,NO,,NO,NO,,YES,,Catalog on which the privilege is granted.
dhw_project,information_schema,catalog_privileges,privilege_type,3,,NO,string,STRING,0.0,0.0,,,,,,,,NO,,,,,,,NO,,NO,NO,,YES,,Privilege being granted.
dhw_project,information_schema,catalog_privileges,is_grantable,4,,NO,string,STRING,0.0,0.0,,,,,,,,NO,,,,,,,NO,,NO,NO,,YES,,Always NO. Reserved for future use.
dhw_project,information_schema,catalog_privileges,inherited_from,5,,NO,string,STRING,0.0,0.0,,,,,,,,NO,,,,,,,NO,,NO,NO,,YES,,The ancestor relation that the privilege is inherited from.
dhw_project,information_schema,catalog_tags,catalog_name,0,,NO,string,STRING,0.0,0.0,,,,,,,,NO,,,,,,,NO,,NO,NO,,YES,,
dhw_project,information_schema,catalog_tags,tag_name,1,,NO,string,STRING,0.0,0.0,,,,,,,,NO,,,,,,,NO,,NO,NO,,YES,,
dhw_project,information_schema,catalog_tags,tag_value,2,,NO,string,STRING,0.0,0.0,,,,,,,,NO,,,,,,,NO,,NO,NO,,YES,,
dhw_project,information_schema,catalogs,catalog_name,0,,NO,string,STRING,0.0,0.0,,,,,,,,NO,,,,,,,NO,,NO,NO,,YES,,Name of the catalog.


In [0]:
df = spark.sql("""
SELECT DISTINCT country FROM dwh_project.gold.dim_customers""")
display(df)

country
""
Germany
United States
United Kingdom
Canada
France
Australia
""


In [0]:
## Dimensions Exploration
## Explore All Countries customers come from

df = spark.sql("""SELECT DISTINCT country FROM dwh_project.gold.dim_customers""").dropna()
display(df)

country
Germany
United States
United Kingdom
Canada
France
Australia
""


In [0]:
## Explore All Categories & subcategories & products

df = spark.sql("""SELECT DISTINCT category, subcategory, product_name FROM dwh_project.gold.dim_products
ORDER BY 1,2,3""").dropna()
display(df)

category,subcategory,product_name
Accessories,Bike Racks,Hitch Rack - 4-Bike
Accessories,Bike Stands,All-Purpose Bike Stand
Accessories,Bottles and Cages,Mountain Bottle Cage
Accessories,Bottles and Cages,Road Bottle Cage
Accessories,Bottles and Cages,Water Bottle - 30 oz.
Accessories,Cleaners,Bike Wash - Dissolver
Accessories,Fenders,Fender Set - Mountain
Accessories,Helmets,Sport-100 Helmet- Black
Accessories,Helmets,Sport-100 Helmet- Blue
Accessories,Helmets,Sport-100 Helmet- Red


In [0]:
## Date Exploration
## Find the date of the first and last order
## How many years of sales are avaiable

df = spark.sql("""SELECT
    MIN(order_date) AS first_order_date,
    MAX(order_date) AS last_order_date,
    ROUND(DATEDIFF(MAX(order_date), MIN(order_date)) / 365.25, 2) AS order_range_years
FROM dwh_project.gold.fact_sales""")
display(df)

first_order_date,last_order_date,order_range_years
2010-12-29,2014-01-28,3.08


In [0]:
## Find the youngest and the oldest customer

df = spark.sql("""SELECT
    MIN(birthdate) AS oldest_birthdate,
    FLOOR(DATEDIFF(CURRENT_DATE, MIN(birthdate)) / 365.25) AS oldest_age,
    MAX(birthdate) AS youngest_birthdate,
    FLOOR(DATEDIFF(CURRENT_DATE, MAX(birthdate)) / 365.25) AS youngest_age
FROM dwh_project.gold.dim_customers""")
display(df)

oldest_birthdate,oldest_age,youngest_birthdate,youngest_age
1916-02-10,109,1986-06-25,39


In [0]:
## Find the Total sales

df = spark.sql("""SELECT 
	SUM(sales_amount) AS total_sales
FROM dwh_project.gold.fact_sales""")
display(df)

total_sales
29356250


In [0]:

## Find how many items are sold

df = spark.sql("""SELECT 
	SUM(quantity) AS total_quantity
FROM  dwh_project.gold.fact_sales""")
display(df)


total_quantity
60423


In [0]:

## Find the avarage selling price

df = spark.sql("""SELECT
	ROUND(AVG(price),2) AS avg_price
FROM dwh_project.gold.fact_sales""")
display(df)

avg_price
486.04


In [0]:
## Find the total number of orders

df = spark.sql("""SELECT 
	COUNT(DISTINCT order_number) AS total_orders
FROM  dwh_project.gold.fact_sales""")
display(df)

total_orders
27659


In [0]:
## Find the total numbers of products

df = spark.sql("""SELECT 
	COUNT(DISTINCT product_name) AS total_products
FROM dwh_project.gold.dim_products""")
display(df)

total_products
295


In [0]:
## Find the total number of customers

df = spark.sql("""SELECT 
    COUNT(customer_key) AS total_customers
FROM dwh_project.gold.dim_customers""")
display(df)

total_customers
18485


In [0]:
## Find the total number of customers that has placed an order

df = spark.sql("""SELECT 
	COUNT(DISTINCT customer_key) AS total_customers
FROM dwh_project.gold.fact_sales""")
display(df)

total_customers
18484


In [0]:
## Generate a Report that shows all key metrics of the business

df = spark.sql("""
SELECT 'Total Sales' AS measure_name, SUM(sales_amount) AS measure_value FROM dwh_project.gold.fact_sales
UNION ALL
SELECT 'Total Quantity', SUM(quantity) FROM dwh_project.gold.fact_sales
UNION ALL
SELECT 'Average Price', ROUND(AVG(price), 2) FROM dwh_project.gold.fact_sales
UNION ALL
SELECT 'Total Orders', COUNT(DISTINCT order_number) FROM dwh_project.gold.fact_sales
UNION ALL
SELECT 'Total Products', COUNT(DISTINCT product_name) FROM dwh_project.gold.dim_products
UNION ALL
SELECT 'Total Customers', COUNT(customer_key) FROM dwh_project.gold.dim_customers
""")
display(df)

measure_name,measure_value
Total Sales,29356250.0
Total Quantity,60423.0
Average Price,486.04
Total Orders,27659.0
Total Products,295.0
Total Customers,18485.0


In [0]:
## Find total customers by countries

df = spark.sql("""SELECT
    country,
    COUNT(customer_key) AS total_customers
FROM dwh_project.gold.dim_customers
GROUP BY country
ORDER BY total_customers DESC""").dropna()
display(df)

country,total_customers
United States,7482
Australia,3591
United Kingdom,1913
France,1810
Germany,1780
Canada,1571
,337


In [0]:
## Find total customers by gender

df = spark.sql("""SELECT
    gender,
    COUNT(customer_key) AS total_customers
FROM dwh_project.gold.dim_customers
GROUP BY gender
ORDER BY total_customers DESC""")
display(df)

gender,total_customers
Male,9341
Female,9128
,16


In [0]:
## Find total products by category

df = spark.sql("""SELECT
    category,
    COUNT(product_key) AS total_products
FROM dwh_project.gold.dim_products
GROUP BY category
ORDER BY total_products DESC""").dropna()
display(df)

category,total_products
Components,127
Bikes,97
Clothing,35
Accessories,29


In [0]:
## Average costs in each category

df = spark.sql("""
SELECT
category,
ROUND(AVG(cost), 2) AS avg_cost
FROM dwh_project.gold.dim_products
GROUP BY category
ORDER BY avg_cost DESC
""").dropna()
display(df)

category,avg_cost
Bikes,949.44
Components,264.72
Clothing,24.8
Accessories,13.17


In [0]:
## Total revenue generated for each category

df = spark.sql("""
SELECT
p.category,
SUM(f.sales_amount) AS total_revenue
FROM dwh_project.gold.fact_sales f
LEFT JOIN dwh_project.gold.dim_products p
ON p.product_key = f.product_key
GROUP BY p.category
ORDER BY total_revenue DESC
""")
display(df)

category,total_revenue
Bikes,28316272
Accessories,700262
Clothing,339716


In [0]:
## Total revenue generated by each customer

df = spark.sql("""
SELECT
c.customer_key,
c.first_name,
c.last_name,
SUM(f.sales_amount) AS total_revenue
FROM dwh_project.gold.fact_sales f
LEFT JOIN dwh_project.gold.dim_customers c
ON c.customer_key = f.customer_key
GROUP BY
c.customer_key,
c.first_name,
c.last_name
ORDER BY total_revenue DESC
""")
display(df)

customer_key,first_name,last_name,total_revenue
1303,Nichole,Nara,13294
1134,Kaitlyn,Henderson,13294
1310,Margaret,He,13268
1133,Randall,Dominguez,13265
1302,Adriana,Gonzalez,13242
1323,Rosa,Hu,13215
1126,Brandi,Gill,13195
1309,Brad,She,13172
1298,Francisco,Sara,13164
435,Maurice,Shan,12914


In [0]:
## The distribution of sold items across countries

df = spark.sql("""
SELECT
c.country,
SUM(f.quantity) AS total_sold_items
FROM dwh_project.gold.fact_sales f
LEFT JOIN dwh_project.gold.dim_customers c
ON c.customer_key = f.customer_key
GROUP BY c.country
ORDER BY total_sold_items DESC
""")
display(df)

country,total_sold_items
United States,20481
Australia,13346
Canada,7630
United Kingdom,6910
Germany,5626
France,5559
,871


In [0]:
## Which 5 products Generating the Highest Revenue?
## Simple Ranking

df = spark.sql("""
SELECT
p.product_name,
SUM(f.sales_amount) AS total_revenue
FROM dwh_project.gold.fact_sales f
LEFT JOIN dwh_project.gold.dim_products p
ON p.product_key = f.product_key
GROUP BY p.product_name
ORDER BY total_revenue DESC
LIMIT 5
""")
display(df)

product_name,total_revenue
Mountain-200 Black- 46,1373454
Mountain-200 Black- 42,1363128
Mountain-200 Silver- 38,1339394
Mountain-200 Silver- 46,1301029
Mountain-200 Black- 38,1294854


In [0]:

## Complex but Flexibly Ranking Using Window Functions

df = spark.sql("""
SELECT *
FROM (
SELECT
p.product_name,
SUM(f.sales_amount) AS total_revenue,
RANK() OVER (ORDER BY SUM(f.sales_amount) DESC) AS rank_products
FROM dwh_project.gold.fact_sales f
LEFT JOIN dwh_project.gold.dim_products p
ON p.product_key = f.product_key
GROUP BY p.product_name
) AS ranked_products
WHERE rank_products <= 5
""")
display(df)

product_name,total_revenue,rank_products
Mountain-200 Black- 46,1373454,1
Mountain-200 Black- 42,1363128,2
Mountain-200 Silver- 38,1339394,3
Mountain-200 Silver- 46,1301029,4
Mountain-200 Black- 38,1294854,5


In [0]:
## What are the 5 worst-performing products in terms of sales?

df = spark.sql("""
SELECT
p.product_name,
SUM(f.sales_amount) AS total_revenue
FROM dwh_project.gold.fact_sales f
LEFT JOIN dwh_project.gold.dim_products p
ON p.product_key = f.product_key
GROUP BY p.product_name
ORDER BY total_revenue
LIMIT 5
""")
display(df)

product_name,total_revenue
Racing Socks- L,2430
Racing Socks- M,2682
Patch Kit/8 Patches,6382
Bike Wash - Dissolver,7272
Touring Tire Tube,7440


In [0]:

## Find the top 10 customers who have generated the highest revenue

df = spark.sql("""
SELECT
c.customer_key,
c.first_name,
c.last_name,
SUM(f.sales_amount) AS total_revenue
FROM dwh_project.gold.fact_sales f
LEFT JOIN dwh_project.gold.dim_customers c
ON c.customer_key = f.customer_key
GROUP BY
c.customer_key,
c.first_name,
c.last_name
ORDER BY total_revenue DESC
LIMIT 10
""")
display(df)

customer_key,first_name,last_name,total_revenue
1134,Kaitlyn,Henderson,13294
1303,Nichole,Nara,13294
1310,Margaret,He,13268
1133,Randall,Dominguez,13265
1302,Adriana,Gonzalez,13242
1323,Rosa,Hu,13215
1126,Brandi,Gill,13195
1309,Brad,She,13172
1298,Francisco,Sara,13164
435,Maurice,Shan,12914


In [0]:

## The 3 customers with the fewest orders placed

df = spark.sql("""
SELECT
c.customer_key,
c.first_name,
c.last_name,
COUNT(DISTINCT order_number) AS total_orders
FROM dwh_project.gold.fact_sales f
LEFT JOIN dwh_project.gold.dim_customers c
ON c.customer_key = f.customer_key
GROUP BY
c.customer_key,
c.first_name,
c.last_name
ORDER BY total_orders
LIMIT 3
""")
display(df)

customer_key,first_name,last_name,total_orders
16820,Candice,Hu,1
17951,Lance,Ramos,1
15481,Katherine,Cook,1
