Skip to content

sorol25/PythonSQL_Ecommerce_Project

Repository files navigation


πŸ“Œ Project Overview

Real-world Brazilian e-commerce data, analyzed end-to-end β€” from loading 7 raw CSV files into a relational MySQL database, all the way through 15 progressively complex business intelligence queries.

This project demonstrates a production-style Python β†’ SQL analytics pipeline: automated schema inference, bulk data ingestion, and layered business questions answered with clean SQL β€” from basic aggregations to advanced window functions.


πŸ—„οΈ Database Schema

Seven interconnected tables power all queries:

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  customers  │────▢│    orders    │────▢│  order_items β”‚
│─────────────│     │──────────────│     │──────────────│
β”‚ customer_id β”‚     β”‚ order_id     β”‚     β”‚ order_id     β”‚
β”‚ customer_   β”‚     β”‚ customer_id  β”‚     β”‚ product_id   β”‚
β”‚   city      β”‚     β”‚ order_       β”‚     β”‚ seller_id    β”‚
β”‚ customer_   β”‚     β”‚  purchase_   β”‚     β”‚ price        β”‚
β”‚   state     β”‚     β”‚  timestamp   β”‚     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜            β”‚
                           β”‚                    β–Ό
                    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                    β”‚   payments   β”‚     β”‚   products   β”‚
                    │──────────────│     │──────────────│
                    β”‚ order_id     β”‚     β”‚ product_id   β”‚
                    β”‚ payment_     β”‚     β”‚ product_     β”‚
                    β”‚   value      β”‚     β”‚  category    β”‚
                    β”‚ payment_     β”‚     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                    β”‚ installments β”‚
                    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”        β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
     β”‚   sellers    β”‚        β”‚   geolocation    β”‚
     │──────────────│        │──────────────────│
     β”‚ seller_id    β”‚        β”‚ zip_code_prefix  β”‚
     β”‚ seller_city  β”‚        β”‚ lat / lng        β”‚
     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜        β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

πŸ“ Repository Structure

PythonSQL_Ecommerce_Project/
β”‚
β”œβ”€β”€ πŸ““ PythonSql_ecommerce.ipynb     # Main analysis notebook (EDA + all 15 SQL queries)
β”œβ”€β”€ 🐍 csv_to_sql.py                 # Automated CSV β†’ MySQL ingestion script
β”œβ”€β”€ customers.csv
β”œβ”€β”€ orders.csv
β”œβ”€β”€ order_items.csv
β”œβ”€β”€ payments.csv
β”œβ”€β”€ products.csv
β”œβ”€β”€ sellers.csv
β”œβ”€β”€ geolocation.csv
β”‚
└── πŸ“– README.md

πŸš€ Getting Started

Prerequisites

Python 3.8+  |  MySQL Server  |  Jupyter Notebook

Install required Python libraries:

pip install pandas mysql-connector-python matplotlib seaborn numpy

Step 1 β€” Clone the Repository

git clone https://github.com/sorol25/PythonSQL_Ecommerce_Project.git
cd PythonSQL_Ecommerce_Project

Step 2 β€” Set Up MySQL Database

CREATE DATABASE ecommerce;

Step 3 β€” Load CSV Data into MySQL

csv_to_sql.py handles everything automatically:

  • Infers SQL column types from pandas dtypes (INT, FLOAT, DATETIME, TEXT)
  • Creates all 7 tables dynamically with CREATE TABLE IF NOT EXISTS
  • Handles NaN β†’ NULL conversion cleanly before insert
  • Commits each table as an independent transaction

Update the data folder path, then run:

# In csv_to_sql.py β€” update this line to your local path:
folder_path = 'your/path/to/data/folder'
python csv_to_sql.py

Step 4 β€” Open the Analysis Notebook

jupyter notebook PythonSql_ecommerce.ipynb

Update the DB connection at the top:

db = mysql.connector.connect(
    host="localhost",
    username="root",
    password="your_password",
    database="ecommerce"
)

πŸ” Business Questions Solved

🟒 Basic β€” Foundations

# Question Highlight
1 Unique cities where customers are located 4,119 distinct cities
2 Total orders placed in 2017 45,101 orders
3 Total sales per product category 74 categories β€” top: Bed Table Bath at $1.71M
4 % of orders paid in installments ~99.99% of all transactions
5 Customer count by state SP dominates β€” visualized as bar chart

🟑 Intermediate β€” Business Aggregations

# Question SQL Technique
1 Orders per month in 2018 MONTHNAME() Β· GROUP BY Β· sns.barplot
2 Avg products per order, grouped by city CTE + AVG() + multi-table JOIN
3 Revenue % share per product category Inline subquery ratio calculation
4 Price vs. purchase frequency correlation np.corrcoef β†’ r = –0.106 (weak negative)
5 Revenue per seller, ranked DENSE_RANK() OVER(ORDER BY revenue DESC)

πŸ”΄ Advanced β€” Window Functions & Analytics

# Question SQL Technique
1 Moving average of order values per customer AVG() OVER(ROWS BETWEEN 2 PRECEDING AND CURRENT ROW)
2 Cumulative monthly sales per year SUM() OVER(ORDER BY years, months)
3 Year-over-year revenue growth rate LAG() β†’ 2017: +12,112% Β· 2018: +20%
4 6-month customer retention rate Dual CTE + DATE_ADD() + LEFT JOIN
5 Top 3 spenders per year DENSE_RANK() OVER(PARTITION BY year ...)

πŸ“Š Key Findings at a Glance

πŸ’°  Total Platform Revenue (all time)   ~$16,008,872
πŸ“¦  Orders in 2017                       45,101
πŸ“ˆ  YoY Growth 2017 β†’ vs 2016           +12,112%  (platform launch ramp)
πŸ“ˆ  YoY Growth 2018 β†’ vs 2017           +20%
πŸ›οΈ   Top Revenue Category               Bed Table Bath β€” 10.70% of total
πŸ”  Orders Paid via Installments        ~99.99%
πŸ“‰  Price vs. Purchase Correlation      –0.106  (weak negative β€” cheaper items sell more)

πŸ’‘ Featured Queries

Year-over-Year Revenue Growth

WITH a AS (
  SELECT YEAR(orders.order_purchase_timestamp) AS years,
         ROUND(SUM(payments.payment_value), 2)  AS payment
  FROM orders JOIN payments ON orders.order_id = payments.order_id
  GROUP BY years ORDER BY years
)
SELECT years,
       ROUND(
         ((payment - LAG(payment,1) OVER(ORDER BY years)) /
           LAG(payment,1) OVER(ORDER BY years)) * 100
       , 2) AS yoy_growth_pct
FROM a;

Moving Average of Customer Order Values

SELECT customer_id, order_purchase_timestamp, payment,
  AVG(payment) OVER(
    PARTITION BY customer_id
    ORDER BY order_purchase_timestamp
    ROWS BETWEEN 2 PRECEDING AND CURRENT ROW
  ) AS moving_avg
FROM (
  SELECT orders.customer_id, orders.order_purchase_timestamp,
         payments.payment_value AS payment
  FROM payments JOIN orders ON payments.order_id = orders.order_id
) AS a;

Top 3 Customers by Spend Per Year

SELECT years, customer_id, payment, d_rank FROM (
  SELECT YEAR(orders.order_purchase_timestamp) years,
         orders.customer_id,
         SUM(payments.payment_value) payment,
         DENSE_RANK() OVER(
           PARTITION BY YEAR(orders.order_purchase_timestamp)
           ORDER BY SUM(payments.payment_value) DESC
         ) d_rank
  FROM orders JOIN payments ON payments.order_id = orders.order_id
  GROUP BY YEAR(orders.order_purchase_timestamp), orders.customer_id
) AS a
WHERE d_rank <= 3;

πŸ“œ License

MIT License β€” Copyright (c) 2026 Yeamine Alam Sorol
Free to fork, star, and build upon.

πŸ‘¨β€πŸ’» About the Author

Yeamine Alam Sorol Data Analyst


Β 

If this project helped you, consider starring ⭐ the repo or sharing it with someone learning Data Analytics.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors