# Normalizing and Preparing the Superstore Sales Database
This notebook documents the process of normalizing the `sales_data` table into a structured relational database. It includes the creation of normalized tables, the addition of constraints, and the setup of relationships between tables.

## Running Queries in pgAdmin

The following SQL queries are designed to explore the `superstore_sales` database. Before running these queries, the user must ensure the database is properly set up by following these steps:

### Setup Instructions:
1. Run the `main.py` script to create and load the initial `sales_data` table into PostgreSQL.
2. Open **pgAdmin** and connect to the `superstore_sales` database using the details below.

### Query Execution:
Once the database setup is complete, you can use the provided SQL queries in **pgAdmin**. 

### Database Details:
- **Database Name**: `superstore_sales`
- **Host**: `localhost`
- **Port**: `5432`
- **Table**: `sales_data`

## Step 1: Verify `sales_data` Table
Before proceeding with normalization, we ensure the `sales_data` table exists and contains data. This step verifies the initial table created by the Python script is available in the PostgreSQL database.

In [None]:
-- Check if the `sales_data` table exists
SELECT EXISTS (
    SELECT FROM information_schema.tables 
    WHERE table_name = 'sales_data'
);

-- Check the number of rows in `sales_data`
SELECT COUNT(*) AS total_rows FROM sales_data;

-- Preview the first 10 rows of the table
SELECT * FROM sales_data LIMIT 10;

## Step 1: Creating Normalized Tables
We normalize the `sales_data` table by breaking it into six related tables:
1. **`customers`**: Contains customer information.
2. **`locations`**: Stores geographical data.
3. **`products`**: Contains product details.
4. **`sales`**: Links orders, customers, products, and locations with transactional data.
5. **`ship_modes`**: standardize shipping methods instead of keeping them only as attributes in the `sales` table.
6. **`promotions`**: details about promotions, which can be linked to the `sales` table to analyze the impact of discounts, seasonal offers, and customer engagement strategies

In [None]:
CREATE TABLE customers AS
SELECT DISTINCT customer_id, customer_name, segment
FROM sales_data;

CREATE TABLE locations AS
SELECT DISTINCT postal_code, city, state, region, country
FROM sales_data;

CREATE TABLE products AS
SELECT DISTINCT product_id, product_name, sub_category, category
FROM sales_data;

CREATE TABLE sales AS
SELECT row_id, order_id, order_date, ship_date, ship_mode, customer_id, postal_code, product_id, sales
FROM sales_data;

-- Ship_modes table will be created extracting the unique values from sales table
CREATE TABLE ship_modes (
    ship_mode_id SERIAL PRIMARY KEY,
    ship_mode_name TEXT UNIQUE NOT NULL
);

INSERT INTO ship_modes (ship_mode_name)
SELECT DISTINCT ship_mode FROM sales;

-- Promotions table will be created manually with sample data
-- The primary key constraints can also be added upon table creation as shown 
CREATE TABLE promotions (
    promo_id SERIAL PRIMARY KEY,
    promo_name VARCHAR(100) NOT NULL,
    discount_pct DECIMAL(5,2) CHECK (discount_pct BETWEEN 0 AND 100),
    start_date DATE NOT NULL,
    end_date DATE NOT NULL,
    applicable_category VARCHAR(50) NOT NULL
);

INSERT INTO promotions (promotion_id, promo_name, discount_pct, start_date, end_date, applicable_category) 
VALUES 
    (1, 'No Promotion', 0.00, NULL, NULL, 'All Categories'),
    (2, 'Black Friday', 30.00, '2023-11-24', '2023-11-26', 'All Categories'),
    (3, 'Back to School', 15.00, '2023-08-01', '2023-08-31', 'Office Supplies'),
    (4, 'Tech Week', 20.00, '2023-09-15', '2023-09-22', 'Technology');

-- Sales table is updated with promo_id = 1 for all rows
UPDATE sales
SET promo_id = 1;

## Step 2: Removing Duplicates and Adding Primary Key Constraints
Primary keys are added to uniquely identify rows in each remaining table:
- `customer_id` for `customers`
- `postal_code` for `locations`
- `product_id` for `products`
- `row_id` for `sales`

In [None]:
ALTER TABLE customers ADD PRIMARY KEY (customer_id);

-- Before adding primary keys to locations and products, we remove duplicate rows to ensure uniqueness

DELETE FROM locations
WHERE ctid NOT IN (
    SELECT MIN(ctid)
    FROM locations
    GROUP BY postal_code
);

ALTER TABLE locations ADD PRIMARY KEY (postal_code);

DELETE FROM products
WHERE ctid NOT IN (
    SELECT MIN(ctid)
    FROM products
    GROUP BY product_id
);

ALTER TABLE products ADD PRIMARY KEY (product_id);
ALTER TABLE sales ADD PRIMARY KEY (row_id);

ALTER TABLE sales 
ADD COLUMN ship_mode_id INT,
ADD COLUMN promo_id INT;

## Step 3: Update `sales` Table to Link to `ship_modes`
Replace text values with corresponding `ship_mode_id` from `ship_modes`.

In [None]:
UPDATE sales
SET ship_mode_id = (SELECT ship_mode_id 
                    FROM ship_modes 
                    WHERE ship_modes.ship_mode_name = sales.ship_mode);

## Step 4: Drop the Old `ship_mode` Column
After ensuring `ship_mode_id` is correctly populated, remove the redundant `ship_mode` text column.

In [None]:
ALTER TABLE sales DROP COLUMN ship_mode;

## Step 5: Adding Foreign Key Constraints
Foreign keys enforce relationships between tables:
- `customer_id` in `sales` references `customers`
- `postal_code` in `sales` references `locations`
- `product_id` in `sales` references `products`
- `ship_mode_id` in `sales` references `ship_modes`
- `promo_id` in `sales` references `promotions`

In [None]:
ALTER TABLE sales
ADD CONSTRAINT fk_customer FOREIGN KEY (customer_id) REFERENCES customers(customer_id);

ALTER TABLE sales
ADD CONSTRAINT fk_location FOREIGN KEY (postal_code) REFERENCES locations(postal_code);

ALTER TABLE sales
ADD CONSTRAINT fk_product FOREIGN KEY (product_id) REFERENCES products(product_id);

ALTER TABLE sales 
ADD CONSTRAINT fk_ship_mode FOREIGN KEY (ship_mode_id) REFERENCES ship_modes(ship_mode_id);

ALTER TABLE sales 
ADD CONSTRAINT fk_promo FOREIGN KEY (promo_id) REFERENCES promotions(promo_id);

## Step 6: Dropping the Raw Table
Finally, we drop the original `sales_data` table, as it is no longer needed.

In [None]:
DROP TABLE sales_data;

## Final Notes
This notebook documents the full process of database normalization, ensuring a structured and efficient relational model for analysis and reporting. The resulting database is optimized for querying and integration with analytical tools.