## Exploratory Data Analysis (EDA)
In this section, we perform an exploratory analysis of the `superstore_sales` dataset using SQL queries. The goal is to uncover patterns, trends, and insights from the data. Each query addresses a specific business question and is designed to provide actionable information for decision-making.

### Key Questions Explored:
1. **Total Sales by Region**: Which regions generate the most sales?
2. **Top Product Sub-Categories**: What are the top-performing product sub-categories by sales?
3. **Top Customers**: Who are the top customers contributing to revenue?
4. **Sales Trends Over Time**: How have sales evolved monthly?
5. **Most Popular Ship Modes**: Which shipping modes are most frequently used?
6. **Sales by City**: Which cities contribute the most to sales?
7. **Delivery Times**: What are the average delivery times for different shipping modes?
8. **Sales by Customer Segment**: How do customer segments contribute to sales?
9. **Top Products**: Which products are the best sellers?
10. **Regional Sales Contribution**: What percentage of total sales does each region contribute?

Each query is followed by its result.


## Running Queries in pgAdmin

The following SQL queries are designed to explore the `superstore_sales` database. Before running these queries, the user must ensure the database is properly set up by following these steps:

### Setup Instructions:
1. Run the `main.py` script to create and load the initial `sales_data` table into PostgreSQL.
2. Open the `DB_normalizing.ipynb` notebook and execute the database normalization queries in **pgAdmin** to normalize the database structure.

### Database Details:
- **Database Name**: `superstore_sales`
- **Host**: `localhost`
- **Port**: `5432`
- **Tables**:
  - `sales`: Contains transactional data and order details (e.g., sales, order dates, product IDs, customer IDs).
  - `customers`: Contains customer information (e.g., name, segment).
  - `products`: Contains product details (e.g., name, category).
  - `locations`: Stores geographical data (e.g., region, state, city).

### Query Execution:
Once the database setup is complete, you can use the provided SQL queries in **pgAdmin**. 

### Steps to Execute Queries:
1. Open **pgAdmin** and connect to the `superstore_sales` database using the above details.
2. Execute the SQL queries provided in this notebook or your EDA workflow.
3. Review and export the results as needed for further analysis or visualization.

In [None]:
-- What is the total sales by region?
SELECT region, SUM(s.sales) AS total_sales
FROM sales s
JOIN locations l ON s.postal_code = l.postal_code
GROUP BY region
ORDER BY total_sales DESC;

| Region   | Total Sales   |
|----------|---------------|
| West     | 710,219.68    |
| East     | 669,518.73    |
| Central  | 492,646.91    |
| South    | 389,151.46    |


In [None]:
-- What are the top 5 product sub-categories by sales?
SELECT p.sub_category, SUM(s.sales) AS total_sales
FROM sales s
JOIN products p ON s.product_id = p.product_id
GROUP BY p.sub_category
ORDER BY total_sales DESC
LIMIT 5;

| Sub-Category | Total Sales   |
|--------------|---------------|
| Phones       | 327,782.45    |
| Chairs       | 322,822.73    |
| Storage      | 219,343.39    |
| Tables       | 202,810.63    |
| Binders      | 200,028.79    |

In [None]:
-- Who are the top 10 customers and their segment by total sales?
SELECT c.customer_name, c.segment, SUM(s.sales) AS total_sales
FROM sales s
JOIN customers c ON s.customer_id = c.customer_id
GROUP BY c.customer_name, c.segment
ORDER BY total_sales DESC
LIMIT 10;

| Customer Name          | Segment       | Total Sales   |
|------------------------|---------------|---------------|
| Sean Miller            | Home Office   | 25,043.05     |
| Tamara Chand           | Corporate     | 19,052.22     |
| Raymond Buch           | Consumer      | 15,117.34     |
| Tom Ashbrook           | Home Office   | 14,595.62     |
| Adrian Barton          | Consumer      | 14,473.57     |
| Ken Lonsdale           | Consumer      | 14,175.23     |
| Sanjit Chand           | Consumer      | 14,142.33     |
| Hunter Lopez           | Consumer      | 12,873.30     |
| Sanjit Engle           | Consumer      | 12,209.44     |
| Christopher Conant     | Consumer      | 12,129.07     |

In [None]:
-- What is the monthly trend in sales over time?
SELECT TO_CHAR(DATE_TRUNC('month', order_date), 'YYYY/MM') AS month, 
       ROUND(SUM(sales), 2) AS total_sales
FROM sales
GROUP BY month
ORDER BY month;

![Monthly Trend in Sales Over Time](EDA%20graph_monthly%20trend%20in%20sales%20over%20time.png)

| Month    | Total Sales   |
|----------|---------------|
| 2015/01  | 14,205.71     |
| 2015/02  | 4,519.89      |
| 2015/03  | 55,205.80     |
| 2015/04  | 27,906.86     |
| 2015/05  | 23,644.30     |
| 2015/06  | 34,322.94     |
| 2015/07  | 33,781.54     |
| 2015/08  | 27,117.54     |
| 2015/09  | 81,623.53     |
| 2015/10  | 31,453.39     |
| 2015/11  | 77,907.66     |
| 2015/12  | 68,167.06     |
| 2016/01  | 18,066.96     |
| 2016/02  | 11,951.41     |
| 2016/03  | 32,339.32     |
| 2016/04  | 34,154.47     |
| 2016/05  | 29,959.53     |
| 2016/06  | 23,599.37     |
| 2016/07  | 28,608.26     |
| 2016/08  | 36,818.34     |
| 2016/09  | 63,133.61     |
| 2016/10  | 31,011.74     |
| 2016/11  | 75,249.40     |
| 2016/12  | 74,543.60     |
| 2017/01  | 18,542.49     |
| 2017/02  | 22,978.82     |
| 2017/03  | 51,165.06     |
| 2017/04  | 38,679.77     |
| 2017/05  | 56,656.91     |
| 2017/06  | 39,724.49     |
| 2017/07  | 38,320.78     |
| 2017/08  | 30,542.20     |
| 2017/09  | 69,193.39     |
| 2017/10  | 59,583.03     |
| 2017/11  | 79,066.50     |
| 2017/12  | 95,739.12     |
| 2018/01  | 43,476.47     |
| 2018/02  | 19,921.00     |
| 2018/03  | 58,863.41     |
| 2018/04  | 35,541.91     |
| 2018/05  | 43,825.98     |
| 2018/06  | 48,190.73     |
| 2018/07  | 44,825.10     |
| 2018/08  | 62,837.85     |
| 2018/09  | 86,152.89     |
| 2018/10  | 77,448.13     |
| 2018/11  | 117,938.16    |
| 2018/12  | 83,030.39     |

In [None]:
-- Which ship mode is used most frequently?
SELECT ship_mode, COUNT(*) AS usage_count
FROM (SELECT DISTINCT order_id, 
					ship_mode
					FROM sales)
GROUP BY ship_mode
ORDER BY usage_count DESC;

| Ship Mode Name    | Usage Count |
|-------------------|-------------|
| Standard Class    | 2,945       |
| Second Class      | 944         |
| First Class       | 772         |
| Same Day          | 261         |

In [None]:
-- Which cities generate the highest sales?
SELECT 
    l.city, 
    ROUND(SUM(s.sales),2) AS total_sales,
    RANK() OVER (ORDER BY SUM(s.sales) DESC) AS rank
FROM sales s
JOIN locations l ON s.postal_code = l.postal_code
GROUP BY l.city
ORDER BY rank
LIMIT 10;

| City             | Total Sales   | Rank |
|------------------|---------------|------|
| New York City    | 252,462.55    | 1    |
| Los Angeles      | 173,420.18    | 2    |
| Seattle          | 116,106.32    | 3    |
| San Francisco    | 109,041.12    | 4    |
| Philadelphia     | 108,841.75    | 5    |
| Houston          | 63,956.14     | 6    |
| San Diego        | 48,113.01     | 7    |
| Chicago          | 47,820.13     | 8    |
| Jacksonville     | 44,713.18     | 9    |
| Detroit          | 42,446.94     | 10   |

In [None]:
-- What is the average delivery time by ship mode?
SELECT ship_mode, 
    ROUND(AVG(ship_date - order_date), 2) AS avg_delivery_days
FROM sales
GROUP BY ship_mode
ORDER BY avg_delivery_days;

| Ship Mode Name    | Avg Delivery Days |
|-------------------|-------------------|
| Same Day          | 0.04             |
| First Class       | 2.18             |
| Second Class      | 3.25             |
| Standard Class    | 5.01             |

In [None]:
-- What is the sales breakdown by customer segment and year?
SELECT 
    EXTRACT(YEAR FROM s.order_date) AS year,
    c.segment,
    ROUND(SUM(s.sales),2) AS total_sales
FROM sales AS s
JOIN customers AS c
ON s.customer_id = c.customer_id
GROUP BY year, c.segment
ORDER BY year, total_sales DESC;

| Year | Segment       | Total Sales   |
|------|---------------|---------------|
| 2015 | Consumer      | 262,956.80    |
| 2015 | Corporate     | 127,797.50    |
| 2015 | Home Office   | 89,101.91     |
| 2016 | Consumer      | 265,356.29    |
| 2016 | Corporate     | 119,675.60    |
| 2016 | Home Office   | 74,404.11     |
| 2017 | Consumer      | 291,142.97    |
| 2017 | Corporate     | 204,977.32    |
| 2017 | Home Office   | 104,072.27    |
| 2018 | Consumer      | 328,604.47    |
| 2018 | Corporate     | 236,043.66    |
| 2018 | Home Office   | 157,403.88    |

In [None]:
-- What are the top-selling products?
SELECT p. category,
	p.product_name,
	ROUND(SUM(s.sales),2) AS total_sales
FROM sales s
JOIN products p ON s.product_id = p.product_id
GROUP BY p.category, p.product_name
ORDER BY total_sales DESC
LIMIT 10;

| Category          | Product Name                                                           | Total Sales   |
|-------------------|------------------------------------------------------------------------|---------------|
| Technology        | Canon imageCLASS 2200 Advanced Copier                                  | 61,599.82     |
| Office Supplies   | Fellowes PB500 Electric Punch Plastic Comb Binding Machine with Manual Bind | 27,453.38     |
| Technology        | Cisco TelePresence System EX90 Videoconferencing Unit                 | 22,638.48     |
| Furniture         | HON 5400 Series Task Chairs for Big and Tall                          | 21,870.58     |
| Office Supplies   | GBC DocuBind TL300 Electric Binding System                            | 19,823.48     |
| Office Supplies   | GBC Ibimaster 500 Manual ProClick Binding System                      | 19,024.50     |
| Technology        | Hewlett Packard LaserJet 3310 Copier                                  | 18,839.69     |
| Technology        | HP Designjet T520 Inkjet Large Format Printer - 24" Color             | 18,374.89     |
| Office Supplies   | GBC DocuBind P400 Electric Binding System                             | 17,965.07     |
| Office Supplies   | High Speed Automatic Electric Letter Opener                           | 17,030.31     |

In [None]:
-- What percentage of total sales is contributed by each region?
WITH region_sales AS (
    SELECT region, SUM(s.sales) AS total_sales
    FROM sales s
    JOIN locations l ON s.postal_code = l.postal_code
    GROUP BY region
)
SELECT region, total_sales, 
       ROUND((total_sales * 100.0 / SUM(total_sales) OVER()), 2) AS percentage_of_total_sales
FROM region_sales
ORDER BY percentage_of_total_sales DESC;

| Region   | Total Sales   | Percentage of Total Sales (%) |
|----------|---------------|-------------------------------|
| West     | 710,219.68    | 31.40                        |
| East     | 669,518.73    | 29.60                        |
| Central  | 492,646.91    | 21.78                        |
| South    | 389,151.46    | 17.21                        |

## Summary of Findings

The exploratory data analysis of the `superstore_sales` dataset revealed key insights into the business's performance and dynamics:

### 1. Regional Sales Distribution
- The **West** region contributed the most to total sales (**31.40%**), followed by the **East** (**29.60%**).
- The **Central** and **South** regions accounted for **21.78%** and **17.21%**, respectively.

### 2. Top Product Sub-Categories
- **Phones** generated the highest revenue (**$327,782.45**), followed by **Chairs** (**$322,822.73**).
- Other top sub-categories include **Storage**, **Tables**, and **Binders**, highlighting the importance of technology and office supplies.

### 3. Top Customers
- The highest-spending customer was **Sean Miller**, with total sales of **$25,043.05**.
- **Tamara Chand** and **Raymond Buch** followed, contributing **$19,052.22** and **$15,117.34**, respectively.

### 4. Sales Trends Over Time
- Sales show seasonal peaks, particularly in **September** and **November**.
- **2018** was the best-performing year, with **November 2018** achieving the highest monthly revenue of **$117,938.16**.

### 5. Shipping Performance
- **Standard Class** was the most utilized shipping mode (**2,945 orders**) but had the longest average delivery time (**5.01 days**).
- **Same Day** shipping, though less common (**261 orders**), had the fastest average delivery time (**0.04 days**).

### 6. City-Level Analysis
- **New York City** generated the most sales (**$252,462.55**), followed by **Los Angeles** (**$173,420.18**) and **Seattle** (**$116,106.32**).
- The top 10 cities represent a significant portion of overall sales, emphasizing the importance of urban markets.

### 7. Segment and Category Insights
- The **Consumer** segment consistently outperformed the **Corporate** and **Home Office** segments in total sales.
- In the product category analysis, **Technology** items, like the "Canon imageCLASS 2200 Advanced Copier," topped sales, generating **$61,599.82** individually.