# Project Dataset Overview

This notebook provides an overview and initial exploration of the data warehouse used for business intelligence and analytics. The dataset is structured to support analysis of sales, customers, products, and related business processes. It is organized in a star-schema format, commonly used in data warehousing for efficient querying and reporting.


**Purpose:**
- Understand the structure and contents of the dataset before performing any analysis.
- Identify key tables, their relationships, and the types of information available.
- Prepare for further data profiling, cleaning, and modeling steps.


**Source:**
- The data is provided as CSV files in the `data/raw/` directory, simulating a typical data warehouse export.
- Tables are split into *fact* and *dimension* tables following standard data warehousing conventions.


**Business Context:**
- The dataset supports business questions related to sales performance, customer demographics, product analysis, and more.
- It is suitable for exercises in data discovery, ETL, and analytics.

### Data discovery & warehouse understanding
Understand the structure of the dataset and make informed decisions before analysis.

In [1]:
# os is used to interact with the operating system (e.g., listing files)
import os

# pandas is used to load and inspect tabular data (CSV, Excel, etc.)
import pandas as pd

#### Handling Data Directory Paths

Depending on how the notebook is launched, the working directory may differ. If run from the `notebooks/` folder, the relative path to the data is `../data/raw`. If run from the project root, use `data/raw`. To ensure the code works in both scenarios, we check for both possible paths and select the one that exists. This makes the notebook robust and portable across different environments.

In [2]:
# Try both possible data directory paths
possible_dirs = ["../data/raw", "data/raw"]
DATA_DIR = None

# Find the first existing data directory
for d in possible_dirs:
    # Check if the directory exists
    if os.path.isdir(d):
        DATA_DIR = d
        break
    
# Raise an error if no data directory is found
if DATA_DIR is None:
    raise FileNotFoundError("Could not find the data/raw directory. Checked: {}".format(possible_dirs))

# List all files inside the data directory
files = os.listdir(DATA_DIR)

# Display the list of files
files

['FactInternetSales.csv',
 'DimSalesReason.csv',
 'FactCallCenter.csv',
 'DimPromotion.csv',
 'DimProduct.csv',
 'FactCurrencyRate.csv',
 'DimProductCategory.csv',
 'DimProductSubcategory.csv',
 'DimGeography.csv',
 'DimDate.csv',
 'FactFinance.csv',
 'DimCustomer.csv',
 'FactSalesTargets.csv',
 'DimReseller.csv',
 'DimOrganization.csv',
 'DimDepartmentGroup.csv',
 'DimAccount.csv',
 'DimScenario.csv',
 'DimCurrency.csv',
 'DimDate.xlsx',
 'DimSalesTerritory.csv']

In [4]:
# Identify dimension tables (Dim*.csv)
dim_files = [f for f in files if f.startswith("Dim") and f.endswith(".csv")]

# Identify fact tables (Fact*.csv)
fact_files = [f for f in files if f.startswith("Fact") and f.endswith(".csv")]

# Identify Excel files (e.g., DimDate.xlsx)
xlsx_files = [f for f in files if f.endswith(".xlsx")]

# Display categorized files
dim_files, fact_files, xlsx_files

(['DimSalesReason.csv',
  'DimPromotion.csv',
  'DimProduct.csv',
  'DimProductCategory.csv',
  'DimProductSubcategory.csv',
  'DimGeography.csv',
  'DimDate.csv',
  'DimCustomer.csv',
  'DimReseller.csv',
  'DimOrganization.csv',
  'DimDepartmentGroup.csv',
  'DimAccount.csv',
  'DimScenario.csv',
  'DimCurrency.csv',
  'DimSalesTerritory.csv'],
 ['FactInternetSales.csv',
  'FactCallCenter.csv',
  'FactCurrencyRate.csv',
  'FactFinance.csv',
  'FactSalesTargets.csv'],
 ['DimDate.xlsx'])

### Directory and File Structure

The data files are organized in the `data/raw/` directory. The naming convention follows standard data warehousing practices:

- **Dimension tables** (`Dim*.csv`): Contain descriptive attributes related to business entities (e.g., customers, products, dates).
- **Fact tables** (`Fact*.csv`): Store quantitative data about business processes (e.g., sales, finance, call center activity).
- **Excel files**: May contain additional or reference data.

This structure helps separate core business events (facts) from descriptive context (dimensions), making the data easier to analyze and maintain.

### Star Schema Design

The dataset follows a **star-schema-based** dimensional data warehouse design, consisting of multiple fact tables representing business processes and dimension tables providing descriptive context.

A **star schema** is a common data warehouse modeling approach where a central fact table is connected to multiple dimension tables. This design enables efficient querying and reporting by separating measurable business events (facts) from descriptive attributes (dimensions).

- **Fact tables** contain keys to dimension tables and numeric measures (e.g., sales amount, quantity).
- **Dimension tables** provide context (e.g., product details, customer demographics, dates).

This structure simplifies complex queries and supports flexible analytics. In this dataset, you will see fact tables (e.g., sales, finance) linked to various dimensions (e.g., product, customer, date).

In [5]:
# Loading one fact table to inspect its structure
fact_sample = pd.read_csv(os.path.join(DATA_DIR, fact_files[0]))

# Display first few rows
fact_sample.head(5)

Unnamed: 0,ProductKey,OrderDateKey,DueDateKey,ShipDateKey,CustomerKey,PromotionKey,CurrencyKey,SalesTerritoryKey,SalesOrderNumber,SalesOrderLineNumber,...,ProductStandardCost,TotalProductCost,SalesAmount,TaxAmt,Freight,CarrierTrackingNumber,CustomerPONumber,OrderDate,DueDate,ShipDate
0,310,20101229,20110110,20110105,21768,1,19,6,SO43697,1,...,2171.2942,2171.2942,3578.27,286.2616,89.4568,,,2010-12-29 00:00:00.000,2011-01-10 00:00:00.000,2011-01-05 00:00:00.000
1,346,20101229,20110110,20110105,28389,1,39,7,SO43698,1,...,1912.1544,1912.1544,3399.99,271.9992,84.9998,,,2010-12-29 00:00:00.000,2011-01-10 00:00:00.000,2011-01-05 00:00:00.000
2,346,20101229,20110110,20110105,25863,1,100,1,SO43699,1,...,1912.1544,1912.1544,3399.99,271.9992,84.9998,,,2010-12-29 00:00:00.000,2011-01-10 00:00:00.000,2011-01-05 00:00:00.000
3,336,20101229,20110110,20110105,14501,1,100,4,SO43700,1,...,413.1463,413.1463,699.0982,55.9279,17.4775,,,2010-12-29 00:00:00.000,2011-01-10 00:00:00.000,2011-01-05 00:00:00.000
4,346,20101229,20110110,20110105,11003,1,6,9,SO43701,1,...,1912.1544,1912.1544,3399.99,271.9992,84.9998,,,2010-12-29 00:00:00.000,2011-01-10 00:00:00.000,2011-01-05 00:00:00.000


### Data Loading and Inspection

To understand the structure and quality of the data, we load sample tables and inspect their contents. This includes:
- Viewing the first few rows to get a sense of the data values.
- Checking the shape (number of rows and columns) to estimate table size.
- Inspecting column names and data types to identify keys, measures, and attributes.

These steps help identify potential data quality issues and inform further analysis or cleaning.

In [6]:
# Check number of rows and columns
fact_sample.shape

(60398, 26)

In [7]:
# Inspect column names and data types
fact_sample.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 60398 entries, 0 to 60397
Data columns (total 26 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   ProductKey             60398 non-null  int64  
 1   OrderDateKey           60398 non-null  int64  
 2   DueDateKey             60398 non-null  int64  
 3   ShipDateKey            60398 non-null  int64  
 4   CustomerKey            60398 non-null  int64  
 5   PromotionKey           60398 non-null  int64  
 6   CurrencyKey            60398 non-null  int64  
 7   SalesTerritoryKey      60398 non-null  int64  
 8   SalesOrderNumber       60398 non-null  object 
 9   SalesOrderLineNumber   60398 non-null  int64  
 10  RevisionNumber         60398 non-null  int64  
 11  OrderQuantity          60398 non-null  int64  
 12  UnitPrice              60398 non-null  float64
 13  ExtendedAmount         60398 non-null  float64
 14  UnitPriceDiscountPct   60398 non-null  int64  
 15  Di

### Glossary of Terms

- **Fact Table:** Contains quantitative data about business processes (e.g., sales, revenue) and foreign keys to dimension tables.
- **Dimension Table:** Provides descriptive context for facts (e.g., product details, customer info, dates).
- **Primary Key:** A unique identifier for each record in a table.
- **Foreign Key:** A field in one table that links to the primary key of another table, establishing relationships between tables.
- **Star Schema:** A data warehouse design with a central fact table connected to multiple dimension tables, resembling a star shape.

## Summary of Dataset Structure

The dataset follows a dimensional (star schema) design consisting of:
- Multiple **fact tables** representing transactional and quantitative data
- Multiple **dimension tables** providing descriptive context

This understanding informs the schema design and join strategy used in the next stage of the analysis.
