Real-world Brazilian e-commerce data, analyzed end-to-end β from loading 7 raw CSV files into a relational MySQL database, all the way through 15 progressively complex business intelligence queries.
This project demonstrates a production-style Python β SQL analytics pipeline: automated schema inference, bulk data ingestion, and layered business questions answered with clean SQL β from basic aggregations to advanced window functions.
Seven interconnected tables power all queries:
βββββββββββββββ ββββββββββββββββ ββββββββββββββββ
β customers ββββββΆβ orders ββββββΆβ order_items β
βββββββββββββββ ββββββββββββββββ ββββββββββββββββ
β customer_id β β order_id β β order_id β
β customer_ β β customer_id β β product_id β
β city β β order_ β β seller_id β
β customer_ β β purchase_ β β price β
β state β β timestamp β ββββββββββββββββ
βββββββββββββββ ββββββββββββββββ β
β βΌ
ββββββββββββββββ ββββββββββββββββ
β payments β β products β
ββββββββββββββββ ββββββββββββββββ
β order_id β β product_id β
β payment_ β β product_ β
β value β β category β
β payment_ β ββββββββββββββββ
β installments β
ββββββββββββββββ
ββββββββββββββββ ββββββββββββββββββββ
β sellers β β geolocation β
ββββββββββββββββ ββββββββββββββββββββ
β seller_id β β zip_code_prefix β
β seller_city β β lat / lng β
ββββββββββββββββ ββββββββββββββββββββ
PythonSQL_Ecommerce_Project/
β
βββ π PythonSql_ecommerce.ipynb # Main analysis notebook (EDA + all 15 SQL queries)
βββ π csv_to_sql.py # Automated CSV β MySQL ingestion script
βββ customers.csv
βββ orders.csv
βββ order_items.csv
βββ payments.csv
βββ products.csv
βββ sellers.csv
βββ geolocation.csv
β
βββ π README.md
Python 3.8+ | MySQL Server | Jupyter Notebook
Install required Python libraries:
pip install pandas mysql-connector-python matplotlib seaborn numpygit clone https://github.com/sorol25/PythonSQL_Ecommerce_Project.git
cd PythonSQL_Ecommerce_ProjectCREATE DATABASE ecommerce;csv_to_sql.py handles everything automatically:
- Infers SQL column types from pandas dtypes (
INT,FLOAT,DATETIME,TEXT) - Creates all 7 tables dynamically with
CREATE TABLE IF NOT EXISTS - Handles
NaN β NULLconversion cleanly before insert - Commits each table as an independent transaction
Update the data folder path, then run:
# In csv_to_sql.py β update this line to your local path:
folder_path = 'your/path/to/data/folder'python csv_to_sql.pyjupyter notebook PythonSql_ecommerce.ipynbUpdate the DB connection at the top:
db = mysql.connector.connect(
host="localhost",
username="root",
password="your_password",
database="ecommerce"
)| # | Question | Highlight |
|---|---|---|
| 1 | Unique cities where customers are located | 4,119 distinct cities |
| 2 | Total orders placed in 2017 | 45,101 orders |
| 3 | Total sales per product category | 74 categories β top: Bed Table Bath at $1.71M |
| 4 | % of orders paid in installments | ~99.99% of all transactions |
| 5 | Customer count by state | SP dominates β visualized as bar chart |
| # | Question | SQL Technique |
|---|---|---|
| 1 | Orders per month in 2018 | MONTHNAME() Β· GROUP BY Β· sns.barplot |
| 2 | Avg products per order, grouped by city | CTE + AVG() + multi-table JOIN |
| 3 | Revenue % share per product category | Inline subquery ratio calculation |
| 4 | Price vs. purchase frequency correlation | np.corrcoef β r = β0.106 (weak negative) |
| 5 | Revenue per seller, ranked | DENSE_RANK() OVER(ORDER BY revenue DESC) |
| # | Question | SQL Technique |
|---|---|---|
| 1 | Moving average of order values per customer | AVG() OVER(ROWS BETWEEN 2 PRECEDING AND CURRENT ROW) |
| 2 | Cumulative monthly sales per year | SUM() OVER(ORDER BY years, months) |
| 3 | Year-over-year revenue growth rate | LAG() β 2017: +12,112% Β· 2018: +20% |
| 4 | 6-month customer retention rate | Dual CTE + DATE_ADD() + LEFT JOIN |
| 5 | Top 3 spenders per year | DENSE_RANK() OVER(PARTITION BY year ...) |
π° Total Platform Revenue (all time) ~$16,008,872
π¦ Orders in 2017 45,101
π YoY Growth 2017 β vs 2016 +12,112% (platform launch ramp)
π YoY Growth 2018 β vs 2017 +20%
ποΈ Top Revenue Category Bed Table Bath β 10.70% of total
π Orders Paid via Installments ~99.99%
π Price vs. Purchase Correlation β0.106 (weak negative β cheaper items sell more)
Year-over-Year Revenue Growth
WITH a AS (
SELECT YEAR(orders.order_purchase_timestamp) AS years,
ROUND(SUM(payments.payment_value), 2) AS payment
FROM orders JOIN payments ON orders.order_id = payments.order_id
GROUP BY years ORDER BY years
)
SELECT years,
ROUND(
((payment - LAG(payment,1) OVER(ORDER BY years)) /
LAG(payment,1) OVER(ORDER BY years)) * 100
, 2) AS yoy_growth_pct
FROM a;Moving Average of Customer Order Values
SELECT customer_id, order_purchase_timestamp, payment,
AVG(payment) OVER(
PARTITION BY customer_id
ORDER BY order_purchase_timestamp
ROWS BETWEEN 2 PRECEDING AND CURRENT ROW
) AS moving_avg
FROM (
SELECT orders.customer_id, orders.order_purchase_timestamp,
payments.payment_value AS payment
FROM payments JOIN orders ON payments.order_id = orders.order_id
) AS a;Top 3 Customers by Spend Per Year
SELECT years, customer_id, payment, d_rank FROM (
SELECT YEAR(orders.order_purchase_timestamp) years,
orders.customer_id,
SUM(payments.payment_value) payment,
DENSE_RANK() OVER(
PARTITION BY YEAR(orders.order_purchase_timestamp)
ORDER BY SUM(payments.payment_value) DESC
) d_rank
FROM orders JOIN payments ON payments.order_id = orders.order_id
GROUP BY YEAR(orders.order_purchase_timestamp), orders.customer_id
) AS a
WHERE d_rank <= 3;MIT License β Copyright (c) 2026 Yeamine Alam Sorol
Free to fork, star, and build upon.