# Create Parquet tables

## Configuration

Before executing with cell, add your name to the file:
<a href="$./includes/configuration" target="_blank">
includes/configuration</a>

```username = "your_name"```

In [0]:
%run ./includes/configuration

Out[3]: DataFrame[]

Reload data to dictionary of DataFrames

In [0]:
# paht to CSV files
file_paths = [raw_path + northwind_file for northwind_file in northwind_files]

# dictionary of DataFrames
raw_data_dict = {}

# reload data
for ix in range(len(file_paths)):
  table_name = northwind_tables[ix]
  file_path = file_paths[ix]
  raw_data_dict[table_name] = spark.read.csv(file_path, header = True, inferSchema = True)

## Create Parquet tables

#### Step 1: Clean up
Files at `trusted_path` are removed.

Then tables are removed.

This step assures the notebook is idempotent. It means the notebook can be executed multiple times and the result will be same - no error raised nor extra files saved.

🚨 **NOTE:** In this sample project, files are saved to Databricks File System (DBFS). The good practice is to save files to cloud storage. DBFS is used only for demo purpose.

In [0]:
dbutils.fs.rm(trusted_path, recurse=True)

for table_name in northwind_tables_trusted:
  spark.sql(
    f"""
    DROP TABLE IF EXISTS {table_name}
    """
  )

#### Step 2: Transform data
Transformations were defined in dictionary `northwind_types`.
<br><br>[Reference to transform with list comprehension](https://stackoverflow.com/questions/70005826/how-to-select-columns-and-cast-column-types-in-a-pyspark-dataframe)
<br>[Reference to create columns `order_year`, `order_month` and `order_day`](https://sparkbyexamples.com/pyspark/pyspark-withcolumn/)

In [0]:
from pyspark.sql.functions import *

trusted_data_dict = {}
for table_name in northwind_columns:
  trusted_data_dict[table_name + '_trusted'] = raw_data_dict[table_name].select([col(c).cast(t).alias(a) for c, t, a, _ in northwind_columns[table_name]])

In [0]:
# transform columns 'order_year', 'order_month' e 'order_day'
trusted_data_dict['orders_trusted'] = trusted_data_dict['orders_trusted'].withColumn('order_year', year(col('order_date')).cast('string'))
trusted_data_dict['orders_trusted'] = trusted_data_dict['orders_trusted'].withColumn('order_month', month(col('order_date')).cast('string'))
trusted_data_dict['orders_trusted'] = trusted_data_dict['orders_trusted'].withColumn('order_day', dayofmonth(col('order_date')).cast('string'))

#### Step 3: Save files to `trusted` folder

1. Use `.format("parquet")`
2. Table 'orders_trusted' is partitioned using columns ``order_year``, ``order_month``, ``order_day``

In [0]:
for table_name in trusted_data_dict:
  if table_name == 'orders_trusted':
    # save partitioned table
    table_name = 'orders_trusted'
    (trusted_data_dict[table_name].write
     .mode("overwrite")
     .format("parquet")
     .partitionBy("order_year", "order_month", "order_day")
     .save(trusted_path + table_name))
  else:
    # save unpartitioned table
    (trusted_data_dict[table_name].write
     .mode("overwrite")
     .format("parquet")
     .save(trusted_path + table_name))

#### Step 4: Register tables in metastore
Spark SQL is used to register tables in metastore.
Tables are created in Parquet format and the data source is the `trusted` folder.

In [0]:
for table_name in trusted_data_dict:
  spark.sql(
    f"""
    DROP TABLE IF EXISTS {table_name}
    """
  )

In [0]:
for table_name in trusted_data_dict:
  spark.sql(
    f"""
    CREATE TABLE {table_name}
    USING PARQUET
    LOCATION "{trusted_path}/{table_name}"
    """
  )

#### Step 5: Verify if Parquet tables are in data lake
Count table entries

In [0]:
for table_name in trusted_data_dict:
  northwind_table = spark.read.table(table_name)
  print(table_name + ":", northwind_table.count())

categories_trusted: 8
customer_customer_demo_trusted: 0
customer_demographics_trusted: 0
customers_trusted: 91
employees_trusted: 9
employee_territories_trusted: 49
order_details_trusted: 2155
orders_trusted: 0
products_trusted: 77
region_trusted: 4
shippers_trusted: 6
suppliers_trusted: 29
territories_trusted: 53
us_states_trusted: 51


**Incorrect value in partitioned table (*orders_trusted*)**

#### Step 6: Partition registry

Following good practices, a partitioned table was created. However, partitions created from existing data are not identified by Spark SQL and registrations in metastore are required.

`MSCK REPAIR TABLE` is going to register partitions in Hive Metastore. [More information in this link.](https://docs.databricks.com/spark/latest/spark-sql/language-manual/sql-ref-syntax-ddl-repair-table.html)

In [0]:
spark.sql("MSCK REPAIR TABLE orders_trusted")

Out[18]: DataFrame[]

#### Step 7: Count table entries

In [0]:
for table_name in trusted_data_dict:
  northwind_table = spark.read.table(table_name)
  print(table_name + ":", northwind_table.count())

categories_trusted: 8
customer_customer_demo_trusted: 0
customer_demographics_trusted: 0
customers_trusted: 91
employees_trusted: 9
employee_territories_trusted: 49
order_details_trusted: 2155
orders_trusted: 830
products_trusted: 77
region_trusted: 4
shippers_trusted: 6
suppliers_trusted: 29
territories_trusted: 53
us_states_trusted: 51


__Table *orders_trusted* with correct count__