### Task 4 - Fact Table Creation

Objective:  Create an analytics-ready fact table.

- Fact Table Name : fact_sales
- Fact Table Grain  :  one record per order item


In [0]:
%sql
use catalog main;
use schema ecommerce;


**Is it necessary to describe header option while reading parquet file ?**

No, it is not necessary to specify option("header", "true") when reading a Parquet file. 
Unlike CSV files, which are plain text and require a "header" flag to distinguish the first row of data from column names, Parquet is a binary columnar format that automatically includes schema information (column names and data types) within its own internal metadata. 

Key Reasons Why
- **Self-Describing Format**: Parquet files store the schema in the file footer. When you read the file back into Spark or Databricks, the engine automatically reads this metadata to reconstruct the DataFrame with the correct column names and types.
- **Automatic Schema Preservation**: Spark SQL preserves the original data schema during both the write and read processes for Parquet files.
- **Efficiency**: Because the schema is built-in, you do not need to use inferSchema or header options, making the read operation faster and less prone to errors than CSV.

In [0]:
df_order_payments_src = spark.read.option("header",True).parquet("/Volumes/main/ecommerce/lakehouse_volumes/silver_dataset/order_payments/")

In [0]:
%sql
create or replace table order_sum_of_payments 
as
select order_id, max(payment_sequential) no_of_months, sum(payment_value) total_payment_value from order_payments  group by order_id


In [0]:
%sql
select * from order_sum_of_payments order by 2 desc;

In [0]:
%sql

create or replace  table  fact_sales
as
select distinct 
      o.order_id,
      oi.order_item_id,
      c.customer_id,
      oi.product_id,
      o.order_purchase_timestamp,
      to_date(o.order_purchase_timestamp,'dd-MM-yyyy') order_date,
      oi.price,
      oi.freight_value,
      (oi.price + oi.freight_value) as revenue,
      c.customer_state,
      c.customer_city,
      UPPER(p.product_category_name)  as product_category_name,
      op.total_payment_value as payment_value,
      op.no_of_months as no_of_months,
      UPPER(o.order_status) as order_status,
      o.LAOD_DATE as load_date
from orders o 
JOIN order_items oi 
ON o.order_id = oi.order_id
JOIN customer c
on c.customer_id = o.customer_id
JOIN order_sum_of_payments op
on op.order_id = o.order_id
JOIN products p
on p.product_id = oi.product_id;




In [0]:
%sql
select count(1) from fact_sales

In [0]:
df_fact_sales = spark.sql("select * from fact_sales")

In [0]:
df_fact_sales.dropDuplicates(["order_id","order_item_id"]).count()


In [0]:
%sql
-- analysis
select s.customer_id,
       s.order_id,
       s.order_date,
       s.order_item_id,
       s.payment_value,
       s.revenue,
       s.order_purchase_timestamp,
       s.product_category_name from fact_sales s order by order_id limit 10 ;

In [0]:
%sql
-- no duplicates
select s.order_id,
       s.order_item_id,
       count(1) from fact_sales s group by s.order_id, s.order_item_id having count(1) > 1 order by 3 desc;

In [0]:
%sql

select * from fact_sales s order by no_of_months desc limit 10;