# Module3  - Excercise overview

### Fabric Prerequistis

You need to have Lakehouse enabled and connected. 

Link to Lakehouse (replace these strings)
- Tables: `abfss://Fabric_2024@onelake.dfs.fabric.microsoft.com/LK_flights.Lakehouse/Tables`
- Files: `abfss://Fabric_2024@onelake.dfs.fabric.microsoft.com/LK_flights.Lakehouse/Files`

You will also need:
- PySpark notebook and connect it to the Fabric standard session


# Exercise 1: Data Movement Using Shortcuts from One Lakehouse to Another

In this exercise, you will move data from one Lakehouse to another using Shortcuts in Microsoft Fabric. Shortcuts allow you to connect data from different storage locations without needing to duplicate the data.
Step-by-Step Instructions:

1) Create Two Lakehouses:
    - In Microsoft Fabric, create two Lakehouses (e.g., Source_Lakehouse and Target_Lakehouse) from the "New" menu.
    - The Source_Lakehouse will contain the data you want to move, and the Target_Lakehouse will be where you place the moved data.

2) Create a Shortcut to the Source Lakehouse:
    - In the Target_Lakehouse, create a Shortcut to the Source_Lakehouse. Shortcuts allow you to access data from one Lakehouse without physically copying it.

3) Move Data Using PySpark:
    - Using a notebook, read the data from the Shortcut and write it into the Target_Lakehouse.

In [None]:
from pyspark.sql import SparkSession

# Initialize Spark Session
spark = SparkSession.builder.appName("DataMovement").getOrCreate()

# Path to the shortcut created in the Target Lakehouse pointing to Source Lakehouse
source_shortcut_path = "Tables/Shortcut_to_Source_Lakehouse"

# Load data from the Source Lakehouse via the shortcut
df = spark.read.format("delta").load(source_shortcut_path)

# Show the data loaded from the Source Lakehouse
df.show()

# Write the data to the Target Lakehouse as a Delta table
target_lakehouse_path = "Tables/Target_Lakehouse_Data"
df.write.format("delta").mode("overwrite").save(target_lakehouse_path)

print("Data successfully moved from Source Lakehouse to Target Lakehouse.")


## Exercise 2: Creating a Data Pipeline for Extract-Transform-Load (ETL) in Microsoft Fabric 

In this exercise, you will create a data pipeline that extracts data from a data source, performs transformation, and loads it into a target storage (Lakehouse). You will be using Dataflows Gen2 to automate the process and notebooks to perform custom transformations.
Step-by-Step Instructions:

1) Create a Dataflow:
    - In Microsoft Fabric, create a Dataflow Gen2 for the ETL pipeline.
    - Set the data source as an external database, file, or any other available data sources (e.g., SQL database, CSV file).

2) Extract Data Using a PySpark Notebook:
    - Use a PySpark notebook to extract the data from the source and load it into the Lakehouse.

3) Transform Data:
    - Perform transformations using the notebook (e.g., filtering, aggregating, and cleaning data).

4) Load Transformed Data into Lakehouse:
    - Finally, load the transformed data into a Delta Table in the Lakehouse.

In [None]:
from pyspark.sql import SparkSession

# Initialize Spark Session
spark = SparkSession.builder.appName("ETL_Pipeline").getOrCreate()

# Step 1: Extract Data (Assuming source is a CSV file in Data Lake)
source_data_path = "Files/source_data.csv"
df = spark.read.csv(source_data_path, header=True, inferSchema=True)

# Step 2: Transform Data (Example: Filter out rows where 'age' is less than 30)
transformed_df = df.filter(df['age'] >= 30)

# Step 3: Load Transformed Data into Lakehouse Delta Table
target_table_path = "Tables/Transformed_Data"
transformed_df.write.format("delta").mode("overwrite").save(target_table_path)

print("ETL pipeline executed successfully. Data loaded into Lakehouse.")


## Exercise 3: Data Transformation and Aggregation in a Data Lake Using PySpark

In this exercise, you will work with large datasets in a Data Lake, performing transformations such as aggregation, filtering, and creating summary tables. You'll use PySpark to load, transform, and store data back in the Data Lake for further analysis.
Step-by-Step Instructions:

1) Access Data from the Data Lake:
    - In Microsoft Fabric, connect to your Data Lake (OneLake).
    - Use PySpark to read a dataset stored in the Data Lake.

2) Perform Data Transformations:
    - Apply transformations such as filtering data, performing group-by operations, and calculating summary statistics (e.g., average, sum).

3) Write the Aggregated Data Back to the Data Lake:
    - Save the aggregated/summary data back to the Data Lake in Delta format for further analysis or reporting.

In [None]:
from pyspark.sql import SparkSession

# Initialize Spark Session
spark = SparkSession.builder.appName("DataLake_Transformation").getOrCreate()

# Step 1: Load Data from Data Lake
data_lake_path = "Files/Users/data_lake_sales_data"
sales_df = spark.read.format("delta").load(data_lake_path)

# Step 2: Perform Transformations (Group by 'region' and calculate total sales and average sales)
aggregated_df = sales_df.groupBy("region").agg(
    {"sales_amount": "sum", "sales_amount": "avg"}
)

# Rename the columns for clarity
aggregated_df = aggregated_df.withColumnRenamed("sum(sales_amount)", "total_sales") \
                             .withColumnRenamed("avg(sales_amount)", "average_sales")

# Show the result
aggregated_df.show()

# Step 3: Save the aggregated data back to the Data Lake
output_path = "Files/Users/aggregated_sales_data" # Change the path to your datalakes
aggregated_df.write.format("delta").mode("overwrite").save(output_path)

print("Aggregated data successfully written to Data Lake.")
