# Schema Understanding and Table Joins

This notebook focuses on constructing an analytics-ready dataset by joining fact and dimension tables from the data warehouse.

As introduced in the previous notebook, the dataset follows a star schema, where:
- Fact tables store measurable business events
- Dimension tables provide descriptive attributes

LEFT JOINs are used to preserve all fact records and avoid data loss.

In [None]:
# Standard library for file system operations
import os

# Pandas for data loading and table joins
import pandas as pd

### Defining a Robust Data Loading Function

To simplify and standardise data loading, a small helper function load_csv() is defined.

The function:
- Accepts the data directory and file name
- Allows optional specification of file encoding
- This reduces repetition and keeps data loading consistent across the notebook.

In [None]:
def load_csv(DATA_DIR, filename, encoding="latin1"):
    return pd.read_csv(os.path.join(DATA_DIR, filename), encoding=encoding)

In [None]:
# Try both possible data directory paths
possible_dirs = ["../data/raw", "data/raw"]
DATA_DIR = None

# Find the first existing data directory
for d in possible_dirs:
    # Check if the directory exists
    if os.path.isdir(d):
        DATA_DIR = d
        break

# Raise an error if no data directory is found
if DATA_DIR is None:
    raise FileNotFoundError("Could not find the data/raw directory. Checked: {}".format(possible_dirs))

# Confirm the resolved data directory
DATA_DIR

### Loading tables

The main fact table and selected dimension tables are loaded using the load_csv() function defined earlier.

In [None]:
# Load the main fact table explicitly
fact_sales = load_csv(DATA_DIR, "FactInternetSales.csv", "utf-8-sig")

# Load key dimension tables
dim_product = load_csv(DATA_DIR, "DimProduct.csv")
dim_customer = load_csv(DATA_DIR, "DimCustomer.csv")
dim_date = load_csv(DATA_DIR, "DimDate.csv")

# Display shapes to understand join sizes
fact_sales.shape, dim_product.shape, dim_customer.shape, dim_date.shape

The fact table is significantly larger than the dimension tables, which is consistent with best practices in dimensional data warehouse design.

##### Why specify encoding when loading CSVs?

Some CSV files in this dataset contain special or non-UTF-8 characters, which can cause decoding errors when using pandas’ default settings.

To avoid this, encoding="latin1" is explicitly specified when loading affected files. For FactInternetSales.csv, utf-8-sig is used to handle a Byte Order Mark (BOM) present in the file.

### Inspecting Column Names

In [None]:
# Inspect column names in fact and product dimension tables
print("FactInternetSales columns:")
print(fact_sales.columns.tolist())

print("\nDimProduct columns:")
print(dim_product.columns.tolist())

When inspecting the column names, an issue becomes apparent.

The issue is that the first column in FactInternetSales is named 'ï»¿ProductKey' instead of 'ProductKey'. This happens because the CSV file likely contains a Byte Order Mark (BOM) at the start.

To solve the issue, when reading the CSV, use the encoding `utf-8-sig` for `FactInternetSales.csv`.

No need to use `encoding="utf-8-sig"` for the other tables unless you see the same BOM issue. If the other tables load correctly and their column names look normal, we can keep using encoding="latin1" or the default.



In [None]:
# Standardize column names by stripping whitespace
fact_sales.columns = fact_sales.columns.str.strip()
dim_product.columns = dim_product.columns.str.strip()
dim_customer.columns = dim_customer.columns.str.strip()
dim_date.columns = dim_date.columns.str.strip()

## Joining FactInternetSales with Dimension Tables

The fact table is incrementally joined with relevant dimension tables to enrich transactional records with descriptive attributes.

LEFT JOINs ensure that all sales transactions are retained even if corresponding dimension records are missing.


In [None]:
# Join FactInternetSales with DimProduct to add product attributes
fact_with_product = fact_sales.merge(
    dim_product,
    on="ProductKey",
    how="left"
)

# Validate join result
fact_with_product.shape

### Inspect new columns added by DimProduct

In [None]:
# Show columns coming from DimProduct
product_columns = [col for col in fact_with_product.columns if col not in fact_sales.columns]
product_columns


In [None]:
# Preview selected business-relevant columns
fact_with_product[[
    "SalesOrderNumber",
    "ProductKey",
    "EnglishProductName",
    "Color",
    "SalesAmount",
    "OrderQuantity"
]].head()


In [None]:
dim_customer.shape

In [None]:
# Join with customer dimension
fact_with_product_customer = fact_with_product.merge(
    dim_customer,
    on="CustomerKey",
    how="left"
)

# Validate join
assert fact_with_product_customer.shape[0] == fact_with_product.shape[0], \
    "Row count mismatch after joining DimCustomer"

fact_with_product_customer.shape


## Resulting Analytics Dataset

The final joined dataset represents a denormalized, analytics-ready table. This dataset will be used in subsequent notebooks for:

- Exploratory Data Analysis (EDA)
- Feature engineering
- Machine learning model development
