-sandbox

<div style="text-align: center; line-height: 0; padding-top: 9px;">
  <img src="https://databricks.com/wp-content/uploads/2018/03/db-academy-rgb-1200px.png" alt="Databricks Learning" style="width: 600px">
</div>

# Ingesting Data Lab

Read in CSV files containing products data.

##### Tasks
1. Read with infer schema
2. Read with user-defined schema
3. Read with schema as DDL formatted string
4. Write using Delta format

In [0]:
%run ../Includes/Classroom-Setup

### 1. Read with infer schema
- View the first CSV file using DBUtils method **`fs.head`** with the filepath provided in the variable **`single_product_cs_fil_path`**
- Create **`products_df`** by reading from CSV files located in the filepath provided in the variable **`products_csv_path`**
  - Configure options to use first line as header and infer schema

In [0]:
# TODO
single_product_csv_file_path = f"{datasets_dir}/products/products.csv/part-00000-tid-1663954264736839188-daf30e86-5967-4173-b9ae-d1481d3506db-2367-1-c000.csv"
print(FILL_IN)

products_csv_path = f"{datasets_dir}/products/products.csv"
products_df = (spark
               .read
               .option("sep", ",")
               .option("header", True)
               .option("inferSchema", True)
               .csv(products_csv_path))

products_df.printSchema()
display(products_df)

item_id,name,price
M_STAN_Q,Standard Queen Mattress,1045.0
M_STAN_K,Standard King Mattress,1195.0
M_STAN_T,Standard Twin Mattress,595.0
M_PREM_Q,Premium Queen Mattress,1795.0
M_STAN_F,Standard Full Mattress,945.0
M_PREM_F,Premium Full Mattress,1695.0
M_PREM_T,Premium Twin Mattress,1095.0
M_PREM_K,Premium King Mattress,1995.0
P_DOWN_S,Standard Down Pillow,119.0
P_FOAM_S,Standard Foam Pillow,59.0


**1.1: CHECK YOUR WORK**

In [0]:
assert(products_df.count() == 12)
print("All test pass")

### 2. Read with user-defined schema
Define schema by creating a **`StructType`** with column names and data types

In [0]:
from pyspark.sql.types import  StringType, DoubleType, StructType, StructField
# TODO
user_defined_schema = StructType([
    StructField("item_id", StringType(), True),
    StructField("name", StringType(), True),
    StructField("price", DoubleType(), True)
])

products_df2 = (spark
               .read
               .option("sep", ",")
               .option("header", True)
               .schema(user_defined_schema)
               .csv(products_csv_path))
display(products_df2)

item_id,name,price
M_STAN_Q,Standard Queen Mattress,1045.0
M_STAN_K,Standard King Mattress,1195.0
M_STAN_T,Standard Twin Mattress,595.0
M_PREM_Q,Premium Queen Mattress,1795.0
M_STAN_F,Standard Full Mattress,945.0
M_PREM_F,Premium Full Mattress,1695.0
M_PREM_T,Premium Twin Mattress,1095.0
M_PREM_K,Premium King Mattress,1995.0
P_DOWN_S,Standard Down Pillow,119.0
P_FOAM_S,Standard Foam Pillow,59.0


**2.1: CHECK YOUR WORK**

In [0]:
assert(user_defined_schema.fieldNames() == ["item_id", "name", "price"])
print("All test pass")

In [0]:
from pyspark.sql import Row

expected1 = Row(item_id="M_STAN_Q", name="Standard Queen Mattress", price=1045.0)
result1 = products_df2.first()

assert(expected1 == result1)
print("All test pass")

### 3. Read with DDL formatted string

In [0]:
# TODO
ddl_schema = "item_id string, name string, price double"

products_df3 = (spark
               .read
               .option("sep", ",")
               .option("header", True)
               .schema(ddl_schema)
               .csv(products_csv_path))

**3.1: CHECK YOUR WORK**

In [0]:
assert(products_df3.count() == 12)
print("All test pass")

### 4. Write to Delta
Write **`products_df`** to the filepath provided in the variable **`products_output_path`**

In [0]:
# TODO
products_output_path = working_dir + "/delta/products"
(products_df
 .write
 .format("delta")
 .mode("overwrite")
 .save(products_output_path)
)

**4.1: CHECK YOUR WORK**

In [0]:
verify_files = dbutils.fs.ls(products_output_path)
verify_delta_format = False
verify_num_data_files = 0
for f in verify_files:
    if f.name == "_delta_log/":
        verify_delta_format = True
    elif f.name.endswith(".parquet"):
        verify_num_data_files += 1

assert verify_delta_format, "Data not written in Delta format"
assert verify_num_data_files > 0, "No data written"
del verify_files, verify_delta_format, verify_num_data_files
print("All test pass")

### Clean up classroom

In [0]:
classroom_cleanup()

-sandbox
&copy; 2022 Databricks, Inc. All rights reserved.<br/>
Apache, Apache Spark, Spark and the Spark logo are trademarks of the <a href="https://www.apache.org/">Apache Software Foundation</a>.<br/>
<br/>
<a href="https://databricks.com/privacy-policy">Privacy Policy</a> | <a href="https://databricks.com/terms-of-use">Terms of Use</a> | <a href="https://help.databricks.com/">Support</a>