# Module 1 - Exercise


## Exercise 1: Navigate the Lakehouse and Read Data

In this exercise, we will explore how to connect to a Lakehouse in Microsoft Fabric and perform a basic data read operation using PySpark.
Steps:

1) Create a Lakehouse: In Microsoft Fabric, create a Lakehouse to store structured and unstructured data.
2) Load Data: Assume you have ingested data into your Lakehouse (e.g., California Housing Prices).
2) Use PySpark: Load the data from the Delta Table format in your Lakehouse.

In [2]:
from pyspark.sql import SparkSession
from pyspark.sql import Row


spark = SparkSession.builder.appName("Fabric_Lakehouse_Upload_file").getOrCreate()


data = [
    Row(id=1, name='John Doe', age=100, city='New York'),
    Row(id=2, name='Enrico van de Laar', age=42, city='Utrecht'),
    Row(id=3, name='Tomaz Kastrun', age=45, city='Ljubljana')
]
df = spark.createDataFrame(data)

# Write DataFrame to Delta Table in the Lakehouse
lakehouse_path = "abfss://fabric@onelake.fabric/Lake.Lakehouse/Tables/Sample"  # Modify path according to your Lakehouse structure
df.write.format("delta").mode("overwrite").save(lakehouse_path)

print("Data successfully written to the Delta table in the Lakehouse.")

StatementMeta(, 4f566dd5-c5d5-41ae-a6d0-0eb8f938d69b, 4, Finished, Available, Finished)

Data successfully written to the Delta table in the Lakehouse.


In [3]:
# Step 1: Set up the Spark session
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("Fabric_Lakehouse_Demo").getOrCreate()

# Step 2: Define the Lakehouse table path (Delta Table)
lakehouse_table_path = "abfss://fabric@onelake.fabric/Lake.Lakehouse/Tables/Sample"

# Step 3: Read the data from the Delta table
df = spark.read.format("delta").load(lakehouse_table_path)

# Step 4: Display the first 5 rows of the dataset
df.show(5)


StatementMeta(, 4f566dd5-c5d5-41ae-a6d0-0eb8f938d69b, 5, Finished, Available, Finished)

+---+------------------+---+---------+
| id|              name|age|     city|
+---+------------------+---+---------+
|  1|          John Doe|100| New York|
|  2|Enrico van de Laar| 42|  Utrecht|
|  3|     Tomaz Kastrun| 45|Ljubljana|
+---+------------------+---+---------+



Save the results also to a file

In [4]:
# Step 5: Define the file path in Files section
output_path = "Files/sample_data.csv"

# Step 6: Save DataFrame as a CSV file
df.coalesce(1).write.option("header", "true").csv(output_path)


StatementMeta(, 4f566dd5-c5d5-41ae-a6d0-0eb8f938d69b, 6, Finished, Available, Finished)

## Exercise 2: Create a Simple Data Pipeline

In this exercise, you will create a basic Data Pipeline in Microsoft Fabric using Notebooks and Pipelines. You will extract data from a source, transform it using PySpark, and then load it back into a Lakehouse.

Steps:

1) Create a Data Pipeline: Set up a data pipeline in Microsoft Fabric.
2) Extract: Read data from an external source (e.g., CSV or database) and load it into the Data Lake.
3) Transform: Perform some simple data transformation using PySpark (e.g., filtering and aggregation).
4) Load: Write the transformed data back to your Lakehouse in Delta format.

In [5]:
# Step 1: Read data from a source (e.g., CSV)
source_path = "aabfss://fabric@onelake.fabric/Lake.Lakehouse/Files/sample_data.csv"
df = spark.read.csv(source_path, header=True, inferSchema=True)

# Step 2: Perform transformation (e.g., Filter rows where 'age' > 90)
filtered_df = df.filter(df["Age"] > 90)

# Step 3: Write the transformed data back to the Lakehouse as Delta Table
filtered_df.write.format("delta").mode("overwrite").save("abfss://fabric@onelake.fabric/Lake.Lakehouse/Tables/Sample_filtered")


StatementMeta(, 4f566dd5-c5d5-41ae-a6d0-0eb8f938d69b, 7, Finished, Available, Finished)

Query the table

In [7]:
%%sql

SELECT * FROM Sample_filtered

StatementMeta(, 4f566dd5-c5d5-41ae-a6d0-0eb8f938d69b, 9, Finished, Available, Finished)

<Spark SQL result set with 1 rows and 4 fields>