# Module 2 - Excercise overview


### Fabric Prerequistis

You need to have Lakehouse enabled and connected. 

Link to Lakehouse (replace these strings)
- Tables: `abfss://Fabric_2024@onelake.dfs.fabric.microsoft.com/LK_flights.Lakehouse/Tables`
- Files: `abfss://Fabric_2024@onelake.dfs.fabric.microsoft.com/LK_flights.Lakehouse/Files`

You will also need:
- PySpark notebook and connect it to the Fabric standard session


## Exercise 1: Creating a Lakehouse and Writing Data to a Delta Table

In this exercise, you'll create a storage resource (Lakehouse) in Microsoft Fabric and store structured data in a Delta Table using PySpark.

### Step-by-Step Instructions:

1) Create a Lakehouse in Microsoft Fabric:
    - Go to your Microsoft Fabric workspace.
    - Create a Lakehouse storage option from the "New" menu. The Lakehouse provides a scalable data lake storage.
    - This Lakehouse will serve as your storage layer.

2) Load Data into PySpark DataFrame:
    - In Microsoft Fabric's notebook or PySpark environment, load a dataset into a PySpark DataFrame. In this example, we'll use some sample data (a dictionary) and convert it to a PySpark DataFrame.

3) Write Data to Delta Table:
    - After creating the DataFrame, save it as a Delta table in your Lakehouse storage.

In [None]:
from pyspark.sql import SparkSession
from pyspark.sql import Row


spark = SparkSession.builder.appName("LakehouseExample").getOrCreate()


data = [
    Row(id=1, name='John Doe', age=100, city='New York'),
    Row(id=2, name='Enrico van de Laar', age=42, city='Utrecht'),
    Row(id=3, name='Tomaz Kastrun', age=45, city='Ljubljana')
]
df = spark.createDataFrame(data)

# Write DataFrame to Delta Table in the Lakehouse
lakehouse_path = "Files/Files/Files"  # Modify path according to your Lakehouse structure
df.write.format("delta").mode("overwrite").save(lakehouse_path)

print("Data successfully written to the Delta table in the Lakehouse.")


## Exercise 2: Creating a Spark Cluster for Distributed Computation

In this exercise, you will create a compute resource (Spark Cluster) in Microsoft Fabric and perform distributed data processing using PySpark.
Step-by-Step Instructions:

1) Create a Spark Cluster in Microsoft Fabric:
    - Go to your Microsoft Fabric workspace and choose "Spark Job Definition" from the "New" menu to create a Spark Cluster.
    - Set the cluster size depending on the data size and computational needs.

2) Perform Distributed Computation:
    - Once the Spark Cluster is available, use PySpark to load a dataset and perform distributed computation such as calculating averages or sums.

In [None]:
from pyspark.sql import SparkSession

# Initialize Spark Session
spark = SparkSession.builder.appName("DistributedComputation").getOrCreate()

# Load data into PySpark DataFrame
data = [(1, 'Math', 85),
        (2, 'Math', 90),
        (3, 'Math', 78),
        (1, 'Science', 95),
        (2, 'Science', 89),
        (3, 'Science', 92)]

columns = ['student_id', 'subject', 'score']
df = spark.createDataFrame(data, columns)

# Perform distributed computation: Calculate average score per subject
avg_scores = df.groupBy("subject").avg("score")
avg_scores.show()

# Optionally, write the output to a Delta Table or storage
output_path = "Files/Files/Files"
avg_scores.write.format("delta").mode("overwrite").save(output_path)


### Exercise 3: Partitioning Large Datasets in Storage for Optimized Compute Performance

In this exercise, you will store a large dataset with partitioning to improve compute efficiency. Partitioning helps when working with large datasets by splitting them into smaller, more manageable parts.
Step-by-Step Instructions:

1) Create or Load a Large Dataset:
    - You can create or load a large dataset (for this exercise, we simulate one).
    - In Microsoft Fabric, partition the dataset based on a key (e.g., year, region) to improve query performance.

2) Partition Data in PySpark:
    - Use PySpark to partition the DataFrame based on a column (e.g., region).
    - Write the partitioned data to Delta format in Lakehouse storage.

In [None]:
from pyspark.sql import SparkSession
from pyspark.sql import Row

# Initialize Spark Session
spark = SparkSession.builder.appName("PartitionedData").getOrCreate()

# Create a large sample dataset with multiple regions
data = [
    Row(id=1, name='John Doe', age=28, region='North America'),
    Row(id=2, name='Jane Smith', age=34, region='Europe'),
    Row(id=3, name='Sam Brown', age=22, region='Asia'),
    Row(id=4, name='Linda Green', age=40, region='North America'),
    Row(id=5, name='Tom Harris', age=29, region='Europe')
    # Add more rows to simulate a large dataset...
]

df = spark.createDataFrame(data)

# Partition the dataset by the 'region' column and write to Delta Table
partitioned_path = "Files/Users/partitioned_data" # Modify path according to your Lakehouse structure
df.write.format("delta").partitionBy("region").mode("overwrite").save(partitioned_path)

print("Data successfully written to Lakehouse with partitioning by region.")
