# Module 4 - Excercise overview

### Fabric Prerequistis

You need to have Lakehouse enabled and connected. 

Link to Lakehouse (replace these strings)
- Tables: `abfss://Fabric_2024@onelake.dfs.fabric.microsoft.com/LK_flights.Lakehouse/Tables`
- Files: `abfss://Fabric_2024@onelake.dfs.fabric.microsoft.com/LK_flights.Lakehouse/Files`

You will also need:
- PySpark notebook and connect it to the Fabric standard session

## Exercise 1: Data Analysis and Exploration Using PySpark

In this exercise, you will perform an exploratory data analysis (EDA) using PySpark in a Microsoft Fabric Notebook. You will load a dataset, explore the data, and generate descriptive statistics, visualizations, and insights.

Detailed Instructions:

1) Load Data:
    - In Microsoft Fabric, create a Lakehouse and load a dataset (e.g., a CSV file with sales or customer data).
    - Use PySpark to load the data into a DataFrame.

2) Perform Data Exploration:
    - Use PySpark to explore the data, including checking for missing values, calculating basic statistics, and visualizing relationships between variables.

In [None]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, isnan, when, count

# Initialize Spark Session
spark = SparkSession.builder.appName("DataExploration").getOrCreate()

# Load the dataset from the Lakehouse (Assume sales data)
data_path = "Files/Users/sales_data.csv"
df = spark.read.csv(data_path, header=True, inferSchema=True)

# Step 1: Display the first few rows of the dataset
df.show(5)

# Step 2: Check for missing values in the dataset
df.select([count(when(isnan(c) | col(c).isNull(), c)).alias(c) for c in df.columns]).show()

# Step 3: Generate descriptive statistics for numeric columns
df.describe().show()

# Step 4: Perform simple visualizations (example: group sales by region)
df.groupBy("region").sum("sales_amount").show()


## Exercise 2: Data Preparation for Machine Learning Model

In this exercise, you'll prepare a dataset for machine learning. This involves data cleaning, feature engineering, and splitting the dataset into training and testing sets.
Step-by-Step Instructions:

1) Load and Explore the Data:
    - Load the dataset into a PySpark DataFrame and perform initial exploration.

2) Clean the Data:
    - Handle missing values by imputing or removing them.
    - Convert categorical variables into numerical features using techniques like one-hot encoding.

3) Feature Engineering:
    - Create new features (e.g., time-based features like day of the week, month, etc.).

4) Split Data into Training and Testing Sets:
    - Use PySpark’s built-in functionality to split the data into training and testing sets.

In [None]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, isnan, when, count, year, month
from pyspark.ml.feature import StringIndexer, OneHotEncoder
from pyspark.ml.feature import VectorAssembler
from pyspark.ml import Pipeline

# Initialize Spark Session
spark = SparkSession.builder.appName("DataPreparation").getOrCreate()

# Load the dataset
data_path = "Files/Users/sales_data.csv"
df = spark.read.csv(data_path, header=True, inferSchema=True)

# Step 1: Clean Data (Handle missing values by removing rows with missing 'sales_amount')
df_clean = df.na.drop(subset=["sales_amount"])

# Step 2: Feature Engineering (Example: Extract year and month from 'date' column)
df_clean = df_clean.withColumn("year", year(col("date"))).withColumn("month", month(col("date")))

# Step 3: One-hot encode the 'region' column
indexer = StringIndexer(inputCol="region", outputCol="region_index")
encoder = OneHotEncoder(inputCol="region_index", outputCol="region_encoded")

# Step 4: Assemble all features into a feature vector
assembler = VectorAssembler(inputCols=["year", "month", "region_encoded"], outputCol="features")

# Step 5: Split the data into training and test sets
train_df, test_df = df_clean.randomSplit([0.8, 0.2], seed=42)

# Pipeline to handle transformations
pipeline = Pipeline(stages=[indexer, encoder, assembler])
pipeline_model = pipeline.fit(train_df)
train_df_transformed = pipeline_model.transform(train_df)
test_df_transformed = pipeline_model.transform(test_df)

train_df_transformed.select("features", "sales_amount").show(5)


## Exercise 3: Build and Train a Machine Learning Model and Run Experiments

In this exercise, you will use the prepared dataset to build a machine learning model (e.g., linear regression) and evaluate its performance. You will also run experiments to tune model hyperparameters.
Step-by-Step Instructions:

1) Build and Train a Machine Learning Model:
    - Use the transformed dataset from the previous exercise to build and train a Linear Regression model using PySpark’s MLlib.

2) Run Experiments:
    - Perform hyperparameter tuning using GridSearch or Cross-Validation to find the best-performing model.

3) Evaluate the Model:
    - Evaluate the model on the test set using performance metrics like Mean Squared Error (MSE) or R-squared.

In [None]:
from pyspark.sql import SparkSession
from pyspark.ml.regression import LinearRegression
from pyspark.ml.evaluation import RegressionEvaluator
from pyspark.ml.tuning import ParamGridBuilder, CrossValidator

# Initialize Spark Session
spark = SparkSession.builder.appName("ML_Model_Training").getOrCreate()

# Assuming 'train_df_transformed' and 'test_df_transformed' from Exercise 2

# Step 1: Build and Train a Linear Regression Model
lr = LinearRegression(featuresCol="features", labelCol="sales_amount")

# Step 2: Set up Grid Search for Hyperparameter Tuning
paramGrid = ParamGridBuilder() \
    .addGrid(lr.regParam, [0.01, 0.1, 0.5]) \
    .addGrid(lr.elasticNetParam, [0.0, 0.5, 1.0]) \
    .build()

# Step 3: Cross-Validation setup
evaluator = RegressionEvaluator(labelCol="sales_amount", metricName="rmse")
crossval = CrossValidator(estimator=lr, estimatorParamMaps=paramGrid, evaluator=evaluator, numFolds=3)

# Step 4: Train the model using Cross-Validation
cv_model = crossval.fit(train_df_transformed)

# Step 5: Evaluate the model on the test set
predictions = cv_model.transform(test_df_transformed)
rmse = evaluator.evaluate(predictions)
r2 = evaluator.evaluate(predictions, {evaluator.metricName: "r2"})

print(f"Root Mean Squared Error (RMSE): {rmse}")
print(f"R-Squared: {r2}")

# Show some sample predictions
predictions.select("features", "sales_amount", "prediction").show(5)
